information visualization
書目計量學的兩個主要研究方向:一為成效分析(performance analysis),也就是根據書目資料評估國家、大學、研究者等群體的表現與他們的影響力(Noyons, Moed, & van Raan, 1999; van Raan, 2005a);另一為科學對映(science mapping),是指利用科學對映圖呈現科學研究的結構與動向 (Börner, Chen, & Boyack, 2003; Noyons, Moed, & Luwel, 1999)。此類研究方向通常以作者、參考文獻或是詞語等物件在論文裡的共現(co-occurrence)關係為基礎,進行分析物件的叢集,使得在同一叢集內的物件之間彼此具有較強於叢集間物件的關係,所得到的叢集結果彼此間的關係便可以表現出科學研究的結構。一般而言,以共同出現在論文的關係所獲得的作者叢集可以代表科學研究的社會結構(social structure),參考文獻代表科學研究的知識基礎(intellectual base),其叢集與其間的關係便是呈現出相關的知識結構(intellectual structure),另外詞語的叢集則可以表示科學研究的主題(theme),呈現出相關的認知結構(cognitive structure)。將分析的時間區分成多個時段,在多個時段間彼此相關連的主題,也就是這些主題之間有相當多重複的詞語,為了分析這些延續性的主題之間的演變(evolution),本研究將這些主題形成的序列稱為主題區域(thematic area)。過去對於成效分析的研究,其方法著重於對群體的表現,較少以認知的方式對研究領域內的特定主題以及主題區域進行生產力與影響力的評估。換言之,便是較少整合上述的兩個研究方向。
為了上述的目的,本研究以下列的步驟進行科學對應與成效分析:1)利用相關強度(association strength)(van Eck & Waltman, 2007)估算詞語間的相關性,並且以簡易中央演算法(simple centers algorithm)(Coulter et al., 1998)進行叢集,確認出各個不同時段的主題;2)將叢集的結果利用網絡的密度和網絡彼此相連的強度對應到策略圖(strategic diagram)(Callon et al., 1991)上,以了解這些主體的內聚程度和外延程度;3)分析主題在連續時段上形成的主題區域,確認主題區域的起源與演變;4)測量各個主題及主題區域在書目計量學上的生產力與影響力等成效。
本研究以Fuzzy Sets Theory(FST)做為科學對映與成效分析的案例,實際操作上述的分析,研究結果包括:1)結合各種書目計量工具來分析研究領域認知結構的演化可以發現FST各階段的基礎主題與相關主題區域等重要知識,例如主題區域FUZZY-CONTROL的逐漸增長趨勢,另一主題區域FUZZY-LOGIC則走向減少。2)利用視覺化工具則可以更容易地偵測這些主題或主題區域的演化、重要性和未來走向。3)運用h-index(Alonso et al., 2009; Cabrerizo et al., 2010; Hirsch, 2005)等書目計量指標則更能夠分析各主題與主題區域的質量和影響力。
In bibliometrics, there are two main procedures: performance analysis and science mapping (Noyons, Moed, & Luwel, 1999; van Raan, 2005a). Performance analysis aims at evaluating groups of scientific actors (countries, universities, departments, researchers) and the impact of their activity (Noyons, Moed, & van Raan, 1999; van Raan, 2005a) on the basis of bibliographic data. Science mapping aims at displaying the structural and dynamic aspects of scientific research (Börner, Chen, & Boyack, 2003; Noyons, Moed, & Luwel, 1999). A science map is used to represent the cognitive structure of a research field.
The majority of these methods are mainly focused on measuring the performance of the scientific actors and little research has been carried out in order to measure the performance of given research fields in a conceptual way (specific themes or whole thematic areas). A performance analysis of specific themes or whole thematic areas can measure (quantitatively and qualitatively) the relative contribution of these themes and thematic areas to the whole research field, detecting the most prominent, productive, and highest-impact subfields.
In the case of co-citation analysis, the clusters represent groups of references that can be understood as the intellectual base of the different subfields.
The majority of these methods are mainly focused on measuring the performance of the scientific actors and little research has been carried out in order to measure the performance of given research fields in a conceptual way (specific themes or whole thematic areas). A performance analysis of specific themes or whole thematic areas can measure (quantitatively and qualitatively) the relative contribution of these themes and thematic areas to the whole research field, detecting the most prominent, productive, and highest-impact subfields.
In the case of co-citation analysis, the clusters represent groups of references that can be understood as the intellectual base of the different subfields.
On the other hand, in the case of co-word analysis, the clusters represent groups of textual information that can be understood as semantic or conceptual groups of different topics treated by the research field.
So, the detected clusters can be used with several purposes such as:
• To analyze their evolution through measuring continuance across consecutive subperiods.
• To quantify the research field by means of a performance analysis.
Cluster string (Small, 2006; Small & Upham, 2009; Upham & Small, 2010), rolling clustering (Kandylas et al., 2010) and alluvial diagrams (Rosvall & Bergstrom, 2010) have been used to show the evolution of detected clusters in successive time periods. Other authors proposed to layout the graph of a given time period taking into account previous and subsequent ones (Leydesdorff & Schank, 2008), or to pack synthesized temporal changes into a single graph (Chen, 2004; Chen et al., 2010).
Strategic diagrams (Callon et al., 1991), self-organizing maps (Polanco, Franc¸ ois, & Lamirel, 2001), heliocentric maps (Moya-Anegón et al., 2005), geometrical models (Skupin, 2009) and thematic networks (Bailón-Moreno, Jurado-Alameda, Ruiz-Banos, & Courtial, 2005; López-Herrera et al., 2009) have been proposed to show and layout the research field and its detected subfields.
To sum up, the stages carried out by our approach are:
1. To detect the themes treated by the research field by means of co-word analysis for each studied subperiod.
2. To layout in a low dimensional space the results of the first step (themes).
3. To analyze the evolution of the detected themes through the different subperiods studied, in order to detect the main general thematic areas of the research field, their origins and their inter-relationships.
4. To carry out a performance analysis of the different periods, themes and thematic areas, by means of quantitative and impact measures.
In our proposal, the process is divided into five steps: (1) collection of raw data, (2) selection of the type of item to analyze, (3) extraction of relevant information from the raw data, (4) calculation of similarities between items based on
the extracted information and (5) use of a clustering algorithm to detect the themes.
Similarities between items are calculated based on frequencies of keywords’ co-occurrences. Different similarity measures have been used in the literature, the most popular being Salton’s Cosine and the Jaccard index. In van Eck and Waltman (2009) an analysis of well-known direct similarity measures was made, concluding that the most appropriate measure for normalizing co-occurrence frequencies is the equivalence index (Callon et al., 1991; Michelet, 1988). This measure is also known as association strength (Coulter et al., 1998; van Eck & Waltman, 2007), proximity index (Peters & van Raan, 1993; Rip & Courtial, 1984), or probabilistic affinity index (Zitt, Bassecoulard, & Okubo, 2000).
Different clustering algorithms can be used to create a partition of the keywords network or graph. Recently, some authors have prosed different clustering algorithms to carry out this task: Streemer (Kandylas et al., 2010), spectral clustering (Chen et al., 2010), modularity maximization (Chen & Redner, 2010) and a bootstrap resampling with a significance clustering (Rosvall & Bergstrom, 2010).
As is described in Coulter et al. (1998), the simple centers algorithm uses two passes through the data to produce the desired networks. The first pass (Pass-1) constructs the networks depicting the strongest associations, and links added in this pass are called internal links. The second pass (Pass-2) adds to these networks links of weaker strengths that form associations between networks. The links added during the second pass are called external links.
Callon’s centrality, to be referred to as centrality henceforth, measures the degree of interaction of a network with other networks (Callon et al., 1991) ... Centrality measures the strength of external ties to other themes. We can understand this value as a measure of the importance of a theme in the development of the entire research field analyzed.
Callon’s density, to be referred to as density henceforth, measures the internal strength of the network (Callon et al., 1991) ... Density measures the strength of internal ties among all keywords describing the research theme. This value can be understood as a measure of the theme’s development.
We can find four kinds of themes (Cahlik, 2000; Callon et al., 1991; Courtial & Michelet, 1994; Coulter et al., 1998; He, 1999) according to the quadrant in which they are placed:
• Themes in the upper-right quadrant are both well developed and important for the structuring of a research field. They are known as the motor-themes of the specialty, given that they present strong centrality and high density. The placement of themes in this quadrant implies that they are related externally to concepts applicable to other themes that are conceptually closely related.
• Themes in the upper-left quadrant have well developed internal ties but unimportant external ties and so are of only marginal importance for the field. These themes are very specialized and peripheral in character.
• Themes in the lower-left quadrant are both weakly developed and marginal. The themes of this quadrant have low density and low centrality, mainly representing either emerging or disappearing themes.
• Themes in the lower-right quadrant are important for a research field but are not developed. So, this quadrant groups transversal and general, basic themes.
In a theme, the keywords and their interconnections draw a network graph, called a thematic network. Each thematic network is labelled using the name of the most significant keyword in the associated theme (usually identified by the most central keyword of the theme).
Given a thematic network, a document is called a “core document” if it has at least two keywords presented in the thematic network. If a document has only one keyword associated with the thematic network, it is called a “secondary document”. Both core and secondary documents can belong to more than one thematic network.
So, a thematic area is defined as a group of evolved themes across different subperiods. Note that, depending on the interconnections among them, one theme could belong to a different thematic area, or could not come from any.
As the themes have an associated set of documents (core documents, or secondary documents, or core documents + secondary documents), the thematic areas could also have an associated collection of documents. In this case, the documents associated with each thematic area will be ascertained through the union of the documents associated with the set of themes belonging to each thematic area.
By means of quantitative measures the productivity of the detected themes and thematic areas is analyzed, whereas qualitative measures show the (supposed) quality based on the bibliometric impact of those themes and thematic areas.
• Quantitative measures: number of documents, authors, journals and countries.
• Qualitative or impact measures: number of received citations of the documents and bibliometric indices such as the h-index (Alonso et al., 2009; Cabrerizo et al., 2010; Hirsch, 2005).
This approach combines different bibliometric tools to analyze the evolution of the cognitive structure of a research field, allowing us to discover important knowledge related to its themes and thematic areas.
In such a way, as was pointed out in Section 4.1, we discover that our approach adequately identifies the FST basic themes in each subperiod, because they achieve the highest citation scores and impacts. Additionally, as was shown in Section 4.2, we are able to identify thematic areas (see Table 7) and show their evolutionary behaviour, as with FUZZY-CONTROL whose evolution is increasing or FUZZY-LOGIC whose evolution is decreasing.
This approach is supported by different visualization tools that allow us to easily detect the themes and thematic areas and to understand their evolution, importance and likely future tendencies.
For example, we show the evolution of the FST research field in Fig. 11 and we identify that FUZZY-CONTROL is the most important thematic area with the highest impact, as is shown in Table 7.We have also concluded that FUZZY ROUGH-SETS seems to be the origin of a new thematic area.
This approach is completed by incorporating amore elaborated bibliometric index, i.e., the h-index, which allows us to better analyze the quality or impact of the themes and thematic areas. In our FST analysis, as is shown in Sections 4.1 and 4.2 we use the h-index to evaluate the impact of themes and thematic areas.
• To analyze their evolution through measuring continuance across consecutive subperiods.
• To quantify the research field by means of a performance analysis.
Cluster string (Small, 2006; Small & Upham, 2009; Upham & Small, 2010), rolling clustering (Kandylas et al., 2010) and alluvial diagrams (Rosvall & Bergstrom, 2010) have been used to show the evolution of detected clusters in successive time periods. Other authors proposed to layout the graph of a given time period taking into account previous and subsequent ones (Leydesdorff & Schank, 2008), or to pack synthesized temporal changes into a single graph (Chen, 2004; Chen et al., 2010).
Strategic diagrams (Callon et al., 1991), self-organizing maps (Polanco, Franc¸ ois, & Lamirel, 2001), heliocentric maps (Moya-Anegón et al., 2005), geometrical models (Skupin, 2009) and thematic networks (Bailón-Moreno, Jurado-Alameda, Ruiz-Banos, & Courtial, 2005; López-Herrera et al., 2009) have been proposed to show and layout the research field and its detected subfields.
To sum up, the stages carried out by our approach are:
1. To detect the themes treated by the research field by means of co-word analysis for each studied subperiod.
2. To layout in a low dimensional space the results of the first step (themes).
3. To analyze the evolution of the detected themes through the different subperiods studied, in order to detect the main general thematic areas of the research field, their origins and their inter-relationships.
4. To carry out a performance analysis of the different periods, themes and thematic areas, by means of quantitative and impact measures.
In our proposal, the process is divided into five steps: (1) collection of raw data, (2) selection of the type of item to analyze, (3) extraction of relevant information from the raw data, (4) calculation of similarities between items based on
the extracted information and (5) use of a clustering algorithm to detect the themes.
Similarities between items are calculated based on frequencies of keywords’ co-occurrences. Different similarity measures have been used in the literature, the most popular being Salton’s Cosine and the Jaccard index. In van Eck and Waltman (2009) an analysis of well-known direct similarity measures was made, concluding that the most appropriate measure for normalizing co-occurrence frequencies is the equivalence index (Callon et al., 1991; Michelet, 1988). This measure is also known as association strength (Coulter et al., 1998; van Eck & Waltman, 2007), proximity index (Peters & van Raan, 1993; Rip & Courtial, 1984), or probabilistic affinity index (Zitt, Bassecoulard, & Okubo, 2000).
Different clustering algorithms can be used to create a partition of the keywords network or graph. Recently, some authors have prosed different clustering algorithms to carry out this task: Streemer (Kandylas et al., 2010), spectral clustering (Chen et al., 2010), modularity maximization (Chen & Redner, 2010) and a bootstrap resampling with a significance clustering (Rosvall & Bergstrom, 2010).
As is described in Coulter et al. (1998), the simple centers algorithm uses two passes through the data to produce the desired networks. The first pass (Pass-1) constructs the networks depicting the strongest associations, and links added in this pass are called internal links. The second pass (Pass-2) adds to these networks links of weaker strengths that form associations between networks. The links added during the second pass are called external links.
Callon’s centrality, to be referred to as centrality henceforth, measures the degree of interaction of a network with other networks (Callon et al., 1991) ... Centrality measures the strength of external ties to other themes. We can understand this value as a measure of the importance of a theme in the development of the entire research field analyzed.
Callon’s density, to be referred to as density henceforth, measures the internal strength of the network (Callon et al., 1991) ... Density measures the strength of internal ties among all keywords describing the research theme. This value can be understood as a measure of the theme’s development.
We can find four kinds of themes (Cahlik, 2000; Callon et al., 1991; Courtial & Michelet, 1994; Coulter et al., 1998; He, 1999) according to the quadrant in which they are placed:
• Themes in the upper-right quadrant are both well developed and important for the structuring of a research field. They are known as the motor-themes of the specialty, given that they present strong centrality and high density. The placement of themes in this quadrant implies that they are related externally to concepts applicable to other themes that are conceptually closely related.
• Themes in the upper-left quadrant have well developed internal ties but unimportant external ties and so are of only marginal importance for the field. These themes are very specialized and peripheral in character.
• Themes in the lower-left quadrant are both weakly developed and marginal. The themes of this quadrant have low density and low centrality, mainly representing either emerging or disappearing themes.
• Themes in the lower-right quadrant are important for a research field but are not developed. So, this quadrant groups transversal and general, basic themes.
In a theme, the keywords and their interconnections draw a network graph, called a thematic network. Each thematic network is labelled using the name of the most significant keyword in the associated theme (usually identified by the most central keyword of the theme).
Given a thematic network, a document is called a “core document” if it has at least two keywords presented in the thematic network. If a document has only one keyword associated with the thematic network, it is called a “secondary document”. Both core and secondary documents can belong to more than one thematic network.
So, a thematic area is defined as a group of evolved themes across different subperiods. Note that, depending on the interconnections among them, one theme could belong to a different thematic area, or could not come from any.
As the themes have an associated set of documents (core documents, or secondary documents, or core documents + secondary documents), the thematic areas could also have an associated collection of documents. In this case, the documents associated with each thematic area will be ascertained through the union of the documents associated with the set of themes belonging to each thematic area.
By means of quantitative measures the productivity of the detected themes and thematic areas is analyzed, whereas qualitative measures show the (supposed) quality based on the bibliometric impact of those themes and thematic areas.
• Quantitative measures: number of documents, authors, journals and countries.
• Qualitative or impact measures: number of received citations of the documents and bibliometric indices such as the h-index (Alonso et al., 2009; Cabrerizo et al., 2010; Hirsch, 2005).
This approach combines different bibliometric tools to analyze the evolution of the cognitive structure of a research field, allowing us to discover important knowledge related to its themes and thematic areas.
In such a way, as was pointed out in Section 4.1, we discover that our approach adequately identifies the FST basic themes in each subperiod, because they achieve the highest citation scores and impacts. Additionally, as was shown in Section 4.2, we are able to identify thematic areas (see Table 7) and show their evolutionary behaviour, as with FUZZY-CONTROL whose evolution is increasing or FUZZY-LOGIC whose evolution is decreasing.
This approach is supported by different visualization tools that allow us to easily detect the themes and thematic areas and to understand their evolution, importance and likely future tendencies.
For example, we show the evolution of the FST research field in Fig. 11 and we identify that FUZZY-CONTROL is the most important thematic area with the highest impact, as is shown in Table 7.We have also concluded that FUZZY ROUGH-SETS seems to be the origin of a new thematic area.
This approach is completed by incorporating amore elaborated bibliometric index, i.e., the h-index, which allows us to better analyze the quality or impact of the themes and thematic areas. In our FST analysis, as is shown in Sections 4.1 and 4.2 we use the h-index to evaluate the impact of themes and thematic areas.
沒有留言:
張貼留言