2013年3月27日 星期三

Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics, 57(1), 27-57.

Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics57(1), 27-57.

information visualization

共被引是指兩筆文獻出現在同一論文的參考文獻的現象,共被引的次數可以代表這兩筆文獻的內容具有相關性的程度。網路圖將文獻對應到圖形上的節點,並且依據節點對應文獻之間的共被引相關性的程度大小,決定節點之間是否有連結線,從節點彼此間的連結線構成完全相連的群體或是多個節點共同連結到另一個節點構成星狀(star-shaped)群體等判斷形成叢集(clusters)的情況,並且進而從叢集內各節點對應的文獻確認叢集所代表的主題以及從各主題之間的結構了解整個學科的情形。本論文以組織科學(organization science)為例,討論各種共被引程度的計算以及叢集方法應用於網路圖上呈現的叢集所代表的主題以及各主題之間的結構,包括 1)共被引次數,2)相對於被引次數平均值的共被引次數,3)相對於被引次數最小值的共被引次數,4)相對於被引次數平均值和最小值的共被引次數,5)以共被引次數為基礎的Pearson相關係數和6)以共被引次數為基礎的因素分析。在分析各種方法的叢集結果時,除了直接說明叢集的結果以外,也利用叢集規模大小(cluster size)、叢集密度(cluster density)、中心勢(centalization)、叢集在該網路上的區別性(differentiation within the network)、整體網路的滲透性(penetration of the complete network)等各種數值的結果進行評估。在各種方法中,相對於被引次數平均值的共被引次數、相對於被引次數平均值和最小值的共被引次數和以共被引次數為基礎的Pearson相關係數所得到的叢集大多以完全相連的群體為核心而擴展,在各種評估的數值上相接近,各叢集代表的主題也類似;相對於被引次數最小值的共被引次數得到的叢集則通常以某一個高度被引用的文獻為核心,具有較大的中心勢。

This paper summarizes the present state of co-citation analysis and presents several methods of clustering references.
The database used is a selection of 2,114 documents in the field of organization studies from 1986-2000. .... For the calculation of co-citation networks, the references were reduced to those cited in at least 2% of the evaluated articles, i.e. at least 42 times. This resulted in a data set of 15,761 citations distributed between 194 different references.
Co-citation analysis enables the identification of groups of scientists and their publications, and for conclusions to be drawn about the inner structure of research disciplines, schools or paradigms (SMALL, 1980).
A cocitation is taken to exist if two references or authors appear in the same bibliography. It is interpreted as the measure for similarity of content of the two references or authors. The number of co-citations determines the proximity of any two publications in terms of content.
To date, studies have been focused on biomedicine, which comprises the largest area of publication within the SCI (e.g., SMALL & GRIFFITH, 1974; MULLINS et al., 1977; SMALL, 1977; 1980; SMALL & GREENLEE, 1980; 1989; MCCAIN, 1989; 1991), as well as on information science and sociology of knowledge (WHITE & GRIFFITH, 1981; 1982; KÄRKI, 1996; PERSSON, 1994; WHITE & MCCAIN, 1998).
In most cases, documents or authors are chosen on the basis of their frequency of citation within a delineated ISI database, a criterion that meets the basis premises of co-citation analysis.
As SMALL & SWEENEY (1984) have shown in their comparison of methods, it can be an advantage to define citations not in absolute terms but in their relation to the length of the citing document’s bibliography.
Several alternatives for the weighting of co-citation counts have developed:
1. The highest absolute co-citation counts are assessed for cluster formation, up to a specified limit.
2. Co-citations are measured in relation to the co-citation partners’ citation counts.
3. Co-citations are converted into Pearson’s correlation coefficients, which also achieves a standardization effect.
The macro approach focuses on the overall structure of disciplines, and, ultimately, on the question of which laws govern the evolution of science.
The micro approach aims to describe retrospectively the structure and historical development of individual disciplines or schools of research and their interdependencies. For pragmatic reasons, an author-centred approach dominates here, and scientists tend to rely upon procedures of correlation and factor analysis for their studies. With this approach, a discipline’s structures and lines of development are drawn along its most prominent representatives.
Here, ‘cluster’ is the name given to a group of references with multiple connections to each other, defined by the connection rules for the co-citation network. ... In the following analysis, clusters are interpreted as a self-contained community only if they feature at least one completely interconnected group of three references, or a group of five references with star-shaped connections.
x Cluster size is defined as the number of references that comprise a cluster..
x The in-degree is the sum of internal relationships, i.e., of all co-citations between the references in the cluster yielded by the method of calculation.
x Cluster density is expressed as a percentage and calculated as the quotient of the cluster’s actual in-degree and its maximum possible in-degree.
x Centralization is a measure for the dominance of a single reference within a cluster. Centralization is calculated as the quotient of the co-citation sum of the most-connected reference within the cluster and the mean co-citation sum of all references.
x The out-degree is the sum of all external relationships, i.e., of all co-citations between internal cluster references and those external to it.
x Differentiation within the network serves as an indicator as to how well clusters may be distinguished from one another. Differentiation is calculated as a quotient from the sum of all clusters’ in-degrees and out-degrees.
x Network size is defined as the sum of all references yielding at least one cocitation under the particular method of calculation.
x Penetration of the complete network is expressed as a percentage and calculated as the quotient of the sum of all references in the data set and the size of the network resulting from the calculation method.
x Network differentiation indicates the effectiveness of the method of calculation in representing differences between clusters.
Method 1: Maximum co-citation counts
Cluster formation on the basis of the maximum absolute co-citation counts. The highest 200 counts were selected from the co-citation matrix.
This points to the fact that this method of network formation, in a discipline with a large citation count variance and a low level of polarization in terms of content, is most suitable for the representation of a school’s mainstream research. Often-cited references without frequent cross-references and their subsequent low co-citation counts are thereby discarded, which represents an advantage over straightforward ordering by citation proportions. However, differentiation between clusters is quite low, due to the close connections of citation- and co-citation counts for the most-cited references.
Method 2: Co-citation counts relative to citation mean
To a large extent, this method corresponds to the Jaccard coefficient used by SMALL & GREENLEE (1980), which places the co-citation count in relation to the sum of both partners’ individual citations, less the co-citation count.
This method highlights symmetrical co-citation relations, i.e., co-citations between documents that are cited with similar frequency, as opposed to relations between documents cited with differing frequency. This is manifest in the fact that core documents in the large clusters show far fewer relations than in method 1, where they occupy a central position. In this respect, the method is sensitive to the citation count variance of the data set: the greater the variance, the greater the risk that important cocitations between references cited with differing frequency are overlooked. Aside from the clusters, a series of bridging documents emerge as linking components between individual clusters. This method leads to these documents remaining outside of the study, because taking them into account would diminish the density when the out degree is taken into account.
Method 3: Co-citation counts relative to citation minimum
Using this method, the co-citation count is set in relation to the citation frequency of the smaller of the two co-citation partners. ... This method lends unspecific citations of the most-cited references lesser weight, whereas close co-citation relations between lesser-cited references are highlighted.
However, here the asymmetrical relations between references that are cited most frequently and those cited relatively infrequently are highlighted. Correspondingly, more probable star-shaped clusters are generated around a small number of central references.
Method 4: CoCit-Score
It demonstrates a considerably higher degree of robustness than the previous three methods because it reduces the influence of the citation relation between two co-cited references. The two calculation methods, based on the citation minimum and the citation mean respectively, are linked by a simple multiplication. The formula for calculating the CoCit-Score, which takes a value between 0 and 1, sets the co-citation count in relation to the minimum and mean counts of the two individual citations:
Method 5: Pearson’s correlation coefficient
Therefore, the maximum correlation coefficient of 1 is achieved if two references with identical individual citation counts are always cited in the same article. Where there are two references with differing frequency of citation, the count decreases even when the less commonly cited reference is always accompanied by the more frequently cited reference. The coefficient reaches the theoretical minimum value of -1 if exactly one of two references is cited in each of the evaluated articles. ... It is seen that the co-citation network content matches the network produced on the basis of the citation mean.
Method 6: Factor loadings
Factor analysis presents itself as an alternative to the counting and weighting of cumulated co-citations, with which clusters may be formed from factors and factor loadings.
This supports the conclusion that, given the high level of structural complexity of the data set, factor analysis does not generate a true representation of the co-citation relations it is based on.
In this instance, factor analysis is found to be unsuitable for the differentiation of subfields of research, as it yields low differentiation and clusters that are difficult to interpret.
Cluster formation using the strongest co-citations is the method chosen mainly in early studies up to the mid-1980s. However, it yields an insufficient level of delimitation in a field such as organization science, which is characterized by a low paradigmatical level of differentiation and a blurring of boundaries between individual research subfields.
Cluster formation on the basis of mean citation count or similar measurements, which present the co-citation in relation to the citation frequencies of the respective co-citation partner, may lead to sufficient differentiation. However, this method is susceptible to asymmetries between individual citations, so that the position of central, influential publications in an area of research is not explicit. The resultant mean centralization count for the clusters is low.
Cluster formation on the basis of the minimal individual citation count leads to centralized clusters around the most-frequently cited references, whereas the cross-links between a school’s secondary references receive only low weighting. In individual cases, the resulting density makes delineation between subfields of research more difficult, which again results in a lower degree of differentiation.
The CoCit-Score, which links two criteria for measuring the relative significance of a co-citation, gives similar weighting to both symmetrical and asymmetrical co-citation pairings. Compared to related methods, cluster size is large, so that it can be assumed that the largest clusters were assigned all relatively close references, without jeopardizing differentiation between the clusters.
An equally valid alternative is to interpret the correlation coefficient of a pairing as a coefficient of similarity. Here, selectivity is maximized as much as differentiation in the overall network.
On the macro-level of the complete network, the influence of not using co-citations as absolute values for cluster formation becomes clear. Here, the opportunities for differentiation are quite low. All other methods lead to a similar number of clusters.
On the meso-level of the clusters, the various methods already reveal clear differences. ... Schools of research close in terms of content are then placed closer still, or further away from each other, depending on the method used. However, with the exception of clustering from factor analysis, with its multiple assignations, no fundamentally new groupings emerge that do not equally demonstrate commonality with the other cluster methods.
The differences are greatest on the micro-level of individual relations. ...  The selection methods on the basis of mean citation count, correlation coefficient and the CoCit-Score demonstrate a higher degree of similarity between themselves, whilst the latter shows on average the highest degree of similarities with the other methods.
x Which publications exert most influence on the discipline?
Co-citation analysis on the basis of maximum co-citation counts provides an alternative to a simple citation count. The result consists of an extendable cluster of frequently cited references closely related to other documents.
x Which communities and areas of research does the discipline encompass?
If the aim is to identify as many dense and selective clusters as possible, then the correlation coefficient is the best criterion for defining the co-citational similarity of two documents. A similarly positive result can be expected if the co-citations are set in relation to the individual citations, with greater weight conferred on the smaller of the two references (CoCit-Score).
x Which documents define the discipline’s communities or areas of research?
The method of minimal individual citation generates star-shaped and thereby sufficiently differentiated clusters around the respective dominant documents. This method is particularly well suited when combined with cluster formation using correlation coefficients or CoCit-Scores.

Leydesdorff, L. & Rafols, I. (2011). Indicators of the interdisciplinarity of journals: Diversity, centrality, and citations. Journal of Informetrics, 5, 87-100.

Leydesdorff, L. & Rafols, I. (2011). Indicators of the interdisciplinarity of journals: Diversity, centrality, and citations. Journal of Informetrics, 5, 87-100.

scientometrics

Van den Besselaar & Leydesdorff (1996)認為跨學科性(interdisciplinarity)是一種暫時的現象,當某一個新的專業(specialty)出現,它可能需要大量仰賴它的來源學科或專業(mother disciplines/specialties),但一旦在它成熟後,新的期刊彼此引用增加,形成封閉迴路,便是典型的學科。然而跨學科性也指某些期刊因為應用的緣故,需要採納不同知識體系的期刊,這些期刊通常在期刊階層(journal hierarchy)的上方。

本研究探討六種測量期刊的跨學科性的方式。

在這些方式中,Shannon所提出的熵(entropy)和Gini係數(Gini coefficient, Buchan, 2002)在對某一種期刊進行測量時,是以這個期刊引用其他期刊或是被其他期刊引用的次數為基礎來計算。Shannon的熵 H計算如pi是分布在第i個元素上的機率,熵用來計算分佈上的不確定性(uncertainty)。Gini係數的計算如xi是第i個元素的統計次數,Gini可以用來計算分佈上的不相等(inequality)或不均勻(unevenness)等情形。

除了Shannon的entropy和Gini係數計算外,也可利用期刊間的共被引次數或共引用次數為基礎建立以期刊為節點的網路圖,利用中介中心性(betweenness centrality)估計(Leydesdorff, 2007),本研究的測量方式包括直接以共被引或是共引用次數以及利用矩陣的cosine來分析等兩種方式。

最後,Stirling(2007)建議利用整合來測量多樣性(diversity),dij是兩種期刊之間的距離,在本研究裡分別以歐幾里德距離(Euclidian distance)和(1-cosine)來估算。Stirling(2007)的計算方法曾被Porter & Rafols (2009) 和 Rafols & Meyer (2010)應用在測量文章層次的跨學科性,本研究則是將這種方法應用於期刊層次。

在本研究裡利用Spearman的排名次序相關性(Spearman’s rank-order correlations)來比較上述的六種方式。研究結果發現:以被引用和引用為基礎所產生的結果相關性都不大,作者認為這個現象可能與期刊根據引用所展現的多元知識基礎並不代表其在被引用所展現上也能夠具有多元的讀者,另外,也可能是因為研究前沿的變化較大,而知識基礎則較為穩定。

利用因素分析(factor analysis)將這些指標進行分析,其結果顯示Shannon熵、Gini係數與基於(1-cosine)的Rao-Stirling多樣性歸屬於同一個因素,稱為「跨學科性」;中介中心性與網路圖的規模(size)有關;最後,基於Euclidean距離的Rao-Stirling多樣性則與期刊引用的影響力(impact)有關。

In this study, we investigate network indicators (betweenness centrality), unevenness indicators (Shannon entropy, the Gini coefficient), and more recently proposed Rao–Stirling measures for “interdisciplinarity.”

Among the various journal indicators based on citations, such as impact factors, the immediacy index, cited half-life, etc., a specific indicator of interdisciplinarity has hitherto been lacking (Kajikawa & Mori, 2009; Porter, Roessner, Cohen, & Perreault, 2006; Porter, Cohen, David Roessner, & Perreault, 2007; Wagner et al., in press; Zitt, 2005).

Given the matrix of aggregated journal–journal citations as derived from the Journal Citation Reports(JCR) of the (Social) Science Citation Index, a clustering algorithm usually aims to partition the database in terms of similarities in the distributions— a single (disciplinary) framework.

Some journals reach across boundaries because they relate different subdisciplines into a single (disciplinary) framework. ... Other journals combine intellectual contributions based on methods or instruments used in different disciplines.

Furthermore, interdisciplinarity may be a transient phenomenon. As a new specialty emerges, it may draw heavily on its mother disciplines/specialties, but as it matures a set of potentially new journals can be expected to cite one another increasingly, and thus to develop a type of closure that is typical of “disciplinarity” (Van den Besselaar & Leydesdorff, 1996).

Interdisciplinarity, however, may mean something different at the top of the journal hierarchy (as in the case of Science and Nature) than at the bottom, where one has to draw on different bodies of knowledge for the sake of the application (e.g., in engineering).

Among the network indicators, betweenness centrality seems an obvious candidate for the measurement of interdisciplinarity (Freeman, 1977; Freeman, 1978/1979). One of us experimented with betweenness centrality as an indicator of interdisciplinarity in aggregated journal–journal citation networks (Leydesdorff, 2007). ... Using rotated factor analysis, Bollen et al. (2009b, pp. 4 ff.) found betweenness centrality positioned near the origin of a two-factor solution; this suggests that betweenness centrality might form a separate (third) dimension in their array of 39 possible journal indicators.

The occasion for returning to the research question of a journal indicator for “interdiscipinarity” was provided by the new interest in “interdisciplinarity” in bibliometrics (Laudel & Origgi, 2006; Wagner et al., in press) and the availability of another potential measure: diversity as defined by Stirling (2007; cf.Rao, 1982). Would it perhaps be possible to benchmark the various possible indicators of “interdisciplinarity” against each other? Using this new measure, Porter & Rafols (2009) and Rafols & Meyer (2010), for example, suggested that this new measure would be useful to indicate interdisciplinarity at the article level

Stirling (2007, p. 712) proposed to integrate the (in)equality in a vector with the network structure using the following formula for diversity D:



This measure is also known in the literature as “quadratic entropy” (e.g., Izsáki & Papp, 1995) because unlike traditional measures of diversity such as the Shannon entropy and the Gini index, the probability distributions (pi and pj) of the units of analysis (in our case, the citation distributions of the individual journals) are multiplied by the distance in the (citation) network among them (dij).

Stirling (2007) proposed his approach as “a general framework for analyzing diversity in science, technology and society” because the two dimensions – (un)evenness in the distributions at the vector level and similarity among the vectors at the matrix level – are combined (Rafols & Meyer, 2010).

Data was harvested from the CD-Rom versions of the JCRs 2008 of the Science Citation Index (6598 journals) and the Social Sciences Citation Index (1980 journals). 371 of these journals are covered by both databases. Our set is therefore 6598 + 1980−371 = 8207 journals.

Let us first turn to the vector-based measures. These are based on the frequency distributions of citations of each of the journals, either in the cited or citing directions.

In the extreme case where a journal only cites or is cited by articles in the journal itself, the inequality in the citation distribution is maximal and the uncertainty minimal. Maximum inequality corresponds to a Gini of unity and minimum uncertainty is equal to a Shannon entropy of zero. The journal is then extremely mono-disciplinary.

The Gini coefficient is a well established measure of inequality or unevenness in a distribution

with n being the number of elements in the population and xi being the number of citations of element i in the ranking. The Gini ranges between zero for a completely even distribution and (n–1)/n for a completely uneven distribution, approaching one for large populations.

For comparisons among smaller populations of varying size, this requires a normalization that brings Gini coefficients for all populations to the same maximum unity, i.e., one.


The uncertainty contained in a distribution can be formalized using Shannon’s (1948) formula for probabilistic entropy:
The maximum information is the same and thus a constant for all vectors, namely log2(8207) = 13.00 bits.

Betweenness centrality is defined as follows: Sum(i) (Sum(j) (gikj/gij)) , i <> j <> k

Or, in words: the betweenness centrality of a vertex k is equal to the proportion of all geodesics between pairs (gij) of vertices that include this vertex (gijk).


We shall thus compare betweenness centrality as an indicator of interdisciplinarity by using both the asymmetrical citation matrix (in both the cited and citing directions) and the two symmetrical co-citation matrices (that is, using the numerators of Eq. (7) for distinguishing between zeros and ones).


Euclidean distances seem a most natural candidate for the distance matrix used for measuring Rao–Stirling diversity (Eq. (1)). First, Euclidean distances involve the least restrictive assumptions; second, Euclidean distances can be transformed through simple scaling of dimensions to represent a wide range of possible geometries (Kruskal, 1964); and third, Euclidian distances are more familiar, parsimonious, and intuitively accessible than most other distance measures.


Throughout the paper, we use Spearman’s rank-order correlations because our primary objective is an indication of interdisciplinarity as a variable attribute among journals.


While the Gini coefficient indicates unevenness, Shannon entropy provides an indicator of evenness. In other words, the Gini coefficient can be considered as an indicator of specificity and therefore disciplinarity, whereas the entropy (H) increases both when more cells of the vector are affected and with greater spread among the different categories.


The negative signs of the rank-order correlations between the two indicators (Table 2) show the opposite directionality.





Not surprisingly, there is no strong correlation between rankings in the cited and citing dimensions: journals that build on diverse knowledge bases (citing patterns) do not necessarily have diverse audiences (cited patterns).


Table 2 shows that correlations between the two indicators in the cited dimension ( = –0.803) are higher than in the citing dimension ( = –0.658). This is understandable, since the citing side represents the research front and therefore introduces variability, while the archive of science is cited and thus can be expected to be more stable (Leydesdorff, 1993).


As expected, the entropy measure is affected by size. ... The Gini coefficient corrects for this size effect because of a normalization in the denominator.


The results based on using (1−cosine) as a distance measure can be provided with an interpretation, but an interpretation is more difficult to provide for results based on Euclidean distances.


In summary, these results first suggest that the (1−cosine)-based measure operates on average better as an indicator of interdisciplinarity than the one based on Euclidean distances.


Shannon entropy measures variety at the vector level and can be thus used as an indicator of interdisciplinarity if one is not primarily interested in a correction for size effects. Betweenness centrality in the cosine-normalized matrix provides a measure for interdisciplinarity. Using cosine values as weights for the edges can be expected to improve this measure further. Rao–Stirling diversity measures are sensitive to the distance measure being used.


Factor analysis enables us to study whether the various indicators cover the same ground or should be considered as different.

Leydesdorff (2009) found two main dimensions – namely, size and impact – in the cited direction when using the ISI set of journals and including network indicators.

On the one hand, the impact factor and the immediacy index are highly correlated (Yue, Wilson, & Rousseau, 2004); on the other, total cites and indegree can be considered as indicators of size (Bensman, 2007; Bollen et al., 2009a, 2009b).


Using these four indicators to anchor the two main dimensions in the cited dimension and the six indicators discussed above, Table 9 shows that in a three-factor model – three factors explain 72.4% of the variance in this case – the first factor can indeed be associated with “size” and the third with “impact.”

Entropy, the Gini coefficient, and Rao–Stirling diversity based on (1−cosine) as a distance measure constitute another (second) dimension which one could designate as “interdisciplinarity.”

Betweenness centrality, however, loads highest on the size factor even after normalization for size.

Rao–Stirling diversity based on relative Euclidean distances loads negatively on the third factor (“impact”), and is in this respect different from all the other indicators under study.


The factor structures in Table 10(a and b) – cited and citing, respectively – are considerably different. These results suggest that the underlying structure is more determined by the functionality in the data matrix (cited or citing) than by correlations among the indicators.

In both solutions, however, betweenness before and after normalization load together on a second factor. This is not surprising since the two measures are related (Bollen et al., 2009b).
In both solutions, we also find Rao–Stirling diversity measured on the basis of (1−cosine) as the distance measure and Shannon entropy loading on the same factor.
The Gini coefficient and the Rao–Stirling diversity based on Euclidean distances have a different (i.e., not consistent) position in the cited or the citing directions.


In summary, Shannon entropy qualifies as a vector-based measure of interdisciplinarity.

Our assumption that the Gini coefficient would qualify as an indicator of inequality and therefore (disciplinary) specificity was erroneous: interdisciplinarity is not just the opposite of disciplinarity.
Betweenness centrality and Rao–Stirling diversity (after cosine-normalizations) indicate different aspects of interdisciplinarity. Betweenness centrality, however, remains associated with size more than Rao–Stirling diversity or entropy despite the normalization. Perhaps setting a threshold would change this dependency on size because larger journals can be expected to be cited in or citing from a larger set.


In order to enhance the interpretation by the readership of this journal, we chose the category of Library and Information Science, which contained 61 journals in 2008. In other words, we compare these 61 journals in terms of how they are cited by the 8207 journals in the database. (Note that one can also compute local values for betweenness centrality, etc., using the 61×61 citation matrix among these journals.)


The two ways to measure betweenness centrality and Rao–Stirling diversity, respectively, provide the first two factors. Entropy loads primarily on Factor One with Rao–Stirling diversity, and to a lower extent on Factor Two with betweenness centrality.


Table 12, finally, shows the top 20 journals ranked on their betweenness centrality after normalization as one of the possible indicators for interdisciplinarity. Entropy correlates at the level of = 0.830 with this indicator, and = 0.732 with Rao–Stirling diversity based on (1−cosine) as the distance measure. The latter measure has the advantage of correlating less with size (for example, total cites) than the other two: the with total cites (in 2008) was 0.549 for Rao–Stirling diversity, 0.880 for betweenness centrality, and 0.793 for Shannon entropy.


Among the vector-based indicators, the Shannon entropy takes into account both the reach of a journal in terms of its degree – because this number (n≤N; N= 8207) limits the maximal uncertainty within the subset – and the spread in the pattern of citations among these n journals. By normalizing the entropy as a percentage of this local maximum (log(n)), one can correct for the size effect. But this brings to the top of the ranking specialist journals that are cited equally across a relatively small set.


Betweenness centrality based on cosine-normalized matrices qualifies as an indicator of interdisciplinarity.


One conceptual advantage of the Rao–Stirling diversity measure over betweenness centrality as used in this study is that the values are not binarized during the computation of diversity. An algorithm that would weigh the cosine values as a basis for the computation of betweenness centrality would perhaps improve our capacity to indicate interdisciplinarity (Brandes, 2001).

Rafols, I. & Meyer, M. (2010). Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82, 263-287.


Rafols, I. & Meyer, M. (2010). Diversity and network coherence as indicators of interdisciplinarity: case studies in bionanoscience. Scientometrics, 82, 263-287.

Scientometrics

本研究從知識整合(knowledge integration)的觀點提出分析跨學科(interdisciplinarity)過程的架構,在此架構裡包括文獻(知識)來自多個學科的多元性(diversity)以及文獻間由相似性連結成的網路凝聚性(network coherence)等兩個概念。本研究將文獻的學科多元性表達成其參考文獻(如果數量不足的話可以使用參考文獻的參考文獻)之發表期刊在ISI資料庫內所屬的主題類別(Subject Categories)上的分布情形,包括以下的幾種現象:1)種類數量(variety):如果文獻分布於較多的主題類別,明顯地應該是較具跨學科的可能性;2)分布的平衡性(balance):如果在學科類別上的數量分布不平均的話,便表示該文獻具有較高的學科多元性;3)文獻所分布主題類別的差距(disparity)或相似性(similarity):分布的類別差異愈大,文獻的學科多元性可能愈高。基於上述對學科多元性的剖析,本研究嘗試利用下面的指標來估測:1)分布的類別數目(N);2)由在各類別上的分布機率計算得到的entropy H, pi*log(pi);3)文獻在主題類別上分布的相似性 I (Simpson diversity) pi*pj ;4) Stirling指標Delta  pi*pj*dij,dij是類別間的差異性,sij=1-dij定義為以引用其他類別的次數進行cosine計算引用上的相似性;並且將各主題類別的關係映射成網路圖,以文獻的參考文獻分布在各主題類別的數量決定節點的大小,以圖形視覺化地呈現其學科多元性。網路凝聚性(network coherence)則對於文獻的參考文獻之間由書目耦合(bibliographic coupling)的關係建構起來的網路進行結構分析,運用1)平均連結強度(mean linkage strength) S 和2)平均路徑長度(mean path length) L 來進行估測。平均連結強度的計算是網路圖上任何兩點間的連結強度的平均值,也就是所有參考文獻之間書目耦合關係的平均值;平均路徑長度則是網路圖上任何兩點間的最短路徑長度的平均值。研究結果發現學科多元性指標的N, H, I和Delta與代表網路凝聚性的S和L之間的相關性並不強,這個結果表示這兩種概念分別代表跨學科裡的不同面向。事實上,N甚至與其他三種學科多元性的指標沒有明顯的關連,結果也較不容易解釋。H, I和Delta之間則有相當高的相關性,特別是H和Delta之間,所以可以利用此三者之一來測量文獻的學科多元性,從視覺化的結果也可以容易地發現具有較高學科多元性的文獻。最後,S和1/L之間也同樣具有很高的相關性,兩者皆能夠用來測量。綜合以上的討論,本研究建議在測量文獻的跨學科性時,可以使用Stirling指標 Delta和平均連結強度。依據這兩種指標的高低,可以將要測量的文獻分為四種情形:1)低多元性-高凝聚性:已經專殊化的學科研究,參考文獻大多屬於同一個學科;2)低多元性-低凝聚性:參考文獻屬於同一學科但分屬多個不同的研究專業(specialties);3)高多元性-低凝聚性:參考文獻分屬於多種學科但尚未經過完全整合的情形;4)高多元性-高凝聚性:原先的參考文獻雖然分屬於多種學科但已經完全整合的情形。

We propose a conceptual framework that aims to capture interdisciplinarity in the wider sense of knowledge integration, by exploring the concepts of diversity and coherence.
Disciplinary diversity indicators are developed to describe the heterogeneity of a bibliometric set viewed from predefined categories, i.e. using a top-down approach that locates the set on the global map of science.
Network coherence indicators are constructed to measure the intensity of similarity relations within a bibliometric set, i.e. using a bottom-up approach, which reveals the structural consistency of the publications network.
We carry out case studies on individual articles in bionanoscience to illustrate how these two perspectives identify different aspects of interdisciplinarity: disciplinary diversity indicates the large-scale breadth of the knowledge base of a publication; network coherence reflects the novelty of its knowledge integration.
Review of bibliometric studies on interdisciplinarity
Most investigations use a top-down approach and predefined categories (typically ISI Subject Categories—SCs) to study their proportions and/or relations. For example, van Raan and van Leeuwen (2002) describe interdisciplinarity in an institute in terms of the percentage of publications and citations received to and from each SCs.
Some investigations adopt a bottom-up approach, in which the low-level elements investigated (e.g. publications, papers) are clustered or classified into factors on the basis of multivariate analyses of similarity measures (Small 1973; Braam et al. 1991; van den Besselaar and Leydesdorff 1994; Schmidt et al. 2006). These clusters are then projected in 2D or 3D maps to provide an insight into the structure of the field and estimate the degree of network-level similarity. Similarity measures have also been used to compute network properties, such as centralities, to identify interdisciplinarity (Otte and Rousseau 2002; Leydesdorff 2007).
In this framework, we view the knowledge integration as a dynamical process that is characterised by high cognitive heterogeneity (diversity) and increases in relational structure (coherence); in other words as a process in which previously different and disconnected bodies of research become related.
Diversity: concept and measures
The concept of diversity is used in many scientific fields, from ecology to economics and cultural studies, to refer to three different attributes of a system comprising different categories (Stirling 1998, 2007; Purvis and Hector 2000):
• variety: number of distinctive categories;
• balance: evenness of the distribution of categories;
• disparity or similarity: degree to which the categories are different/similar.
Our interest in using Stirling’s framework to track interdisciplinarity is twofold.
First, since Stirling’s generalised formulation needs a metric (dij) and has open values for the parameters a and b, it highlights that the mathematical form of any diversity index includes some prejudgement of the aspect of diversity that is considered important. High values for b give more weight to the contribution of large categories, and high values for a see the cooccurrence of distant categories as more important. The choice of the metric used to define distance is inevitably value laden.
Second, and very importantly for emerging fields, the 
inclusion of distance among categories lessens the effect of inappropriate categorisation changes: if a new category i is very similar to an existing category j, their distance dij will be close to zero, and its inclusion in categories list will result in only slightly increased diversity.
Coherence: concept and measures
In our bibliometric context, coherence expresses the extent to which publication networks form a more or less compact structure. If we take degree of cognitive similarity as the linkage between publications (e.g. by using co-citation, co-word or bibliographic coupling), a more clustered network is seen as having higher cognitive coherence.
However, since the key aspect of interdisciplinary research has been argued to be the dynamical process of knowledge integration (section ‘‘Definition of interdisciplinarity’’), interdisciplinarity should ideally be assessed in terms of a temporal derivative, i.e. a change in coherence.
High coherence within the reference set in a publication means that its referencing practices are highly specialised and hence, that it builds on an already established research specialty.
(i) Low diversity—High coherence is a case of specialised disciplinary research—all the references are from the same discipline and are related.
(ii) Low diversity—Low coherence is a case of a publication relating distant research specialties within one discipline.
(iii) High diversity—Low coherence is a case of a publication citing references that were hitherto unrelated and belong to different disciplines: a potential instance of interdisciplinary knowledge integration.
(iv) High diversity—High coherence is a case of a publication citing across several disciplines, to references that are similar. This similarity suggests that the references belong a single research specialty. Hence, although the publication is interdisciplinary, it does not involve new knowledge integration.
Operationalisation of disciplinary diversity
The disciplinary diversity of an article was constructed from the distribution of ISI SCs in the references of references (ref-of-refs in Fig. 3, and hereafter) of an article. To compute this distribution, we constructed a frequency list of the journals in which the ref-of-refs were published, and converted it into a frequency list of ISI SCs using the SC attribution of each journal as given in the Journal Citation Reports.
In order to compute the Stirling D diversity, a similarity matrix sij for the SCs must be constructed. To do so, we created a matrix of citation flows matrix between SCs, and then converted it into a Salton’s cosine similarity matrix in the citing dimension. The sij describes the similarity in the citing patterns for each pair of SCs in 2006, for the SCI set (175 SCs).
Operationalisation of network coherence
In order to operationalise network coherence for our bibliometric set,we chose first, a similarity metric between network elements (articles) in order to measure the strength of their linkages; second, an indicator of structural coherence of the network. Since the aim was to map the breadth of knowledge sources, similarity was measured in terms of bibliographic couplings between articles (co-occurrences of references), and normalised using Salton’s cosine (Ahlgren et al. 2003). Then, basic network measures were used as indicators for network coherence:
• Mean linkage strength, S: the mean of the bibliographic coupling matrix, excluding the diagonal—equivalent to network density in binary networks. In valued networks, it describes both realised links and intensity of similarities. By definition, S has a value between zero and 1.
• Mean path length, L: the path length between two articles is defined as the minimum number of links crossed to go from one article to the other over the network. Mean path length describes how ‘spread’ the network is; it is computed after binarising similarities.
Diversities H, I and D were found to be correlated.
Interestingly, the highest correlation was between Shannon H and Stirling D, although Stirling D and Simpson I (rather than Shannon) have similar mathematical formulations.
Since Shannon H gives more weight to the small terms in its sum through its logarithmic factor, while Stirling D gives more weight to the combinations of disparate SCs, we believe that the high correlation between H and D is due to the fact that many SCs with small proportions happen also to be distant from the core SCs.
Indicators of coherence, S and 1/L, were also highly correlated with one another, but not with the diversity measures.
Variety N was not correlated with any other measure, and it does not seem to be a good indicator of knowledge integration.
In this article, we proposed a novel conceptual framework to investigate interdisciplinary processes in the wider sense of knowledge integration. The framework is based on the concepts of diversity and coherence, ....
Diversity was used to capture the disciplinary heterogeneity of our bibliometric set as seen through the filter of predefined categories, i.e. taking a top-down perspective in order to locate the set on the global map of science (Fig. 6).
Coherence was used to apprehend the intensity of similarity relations 
within the bibliometric set, i.e. using a bottom-up approach to reveal the structural consistency and cognitive articulation of the publications network (Fig. 7).
Disciplinary diversity indicators were constructed from diversity indices (Shannon H and Simpson I) and a recently developed indicator (Stirling D, parameterised as Porter’s Integration), which takes account of the similarities between SCs (Stirling 1998, 2007; Porter et al. 2007). ISI SCs were used as disciplinary categories.
Network coherence was operationalised in terms of the network measures Mean linkage strength and mean path length, in bibliographic coupling networks (see Havemann et al. 2007 for a similar approach).
First, we found that the indicators for disciplinary diversity and network coherence were not correlated (Table 4), thus providing ‘orthogonal’ perspectives of the knowledge integration process.
Since there is a trade-off between accuracy and simplicity of a taxonomy, it is possible that the unit of analysis (the article) in this study is too small for the coarse-grained description of science provided by ISI SCs.
Third, we found that measures for network coherence could discriminate among articles according to their different degrees of knowledge integration at micro level. ... The operationalisation of network coherence in terms of mean linkage strength of bibliographic coupling appeared to work well, both for our small sets and in larger studies reported by Havemann et al. (2007). Moreover, it has the advantage of simplicity.
Fourth, the visualisations of diversity (through the overlay of disciplinary proportions on the map of science, Fig. 6), and of coherence (by means of the bibliographic coupling network, Fig. 7), proved more valuable than expected.

Cobo, M. J., López‐Herrera, A. G., Herrera‐Viedma, E., & Herrera, F. (2012). SciMAT: A new science mapping analysis software tool. Journal of the American Society for Information Science and Technology.

Cobo, M. J., López‐Herrera, A. G., Herrera‐Viedma, E., & Herrera, F. (2012). SciMAT: A new science mapping analysis software tool. Journal of the American Society for Information Science and Technology.

information visualization


本論文介紹科學映射分析(science mapping analysis)工具SciMAT的功能與應用。根據Börner et al., (2003)和Cobo et al., (2011b)等研究,科學映射分析的流程可以分成以下的步驟:1) 資料檢索 (data retrieval)、2) 資料前處理 (data preprocessing)、3) 網路資訊抽取 (network extraction)、4) 網路資訊正規化 (network normalization)、5) 映射 (mapping)、6) 分析 (analysis)以及7) 視覺化 (visualization)。特別要說明的是「資料前處理」是處理原始資料的重複和錯誤、區分時段(time slicing)以及網路資料縮減等工作,是決定科學映射分析能否得到良好結果的重要步驟之一。「網路資訊抽取」則是從論文的書目資料裡建立分析項目之間的關連,包括共現(co-occurrence)、耦合(coupling)和直接連結(direct linkage)等關係。兩個分析項目的共現關係取決於它們是否共同出現在一組文件內以及共同出現的次數;文件間的耦合關係則建立於它們是否具有共同的項目以及其數量大小,作者及期刊間的耦合關係則由屬於他們的文件的共同項目聚集而成;直接連結則是文件與它們的參考文獻之間的引用關係。運用不同的分析項目以及不同的關係可以對科學研究領域進行各種面向的分析,例如以文件中共同出現的作者所抽取的共同作者關係建立的網絡可以分析科學研究領域的社會結構(social structure);對於由詞語在文件內的共現關係所建構的詞語共現網絡進行分析則可以得知領域的概念結構(conceptual structure)和所處理的主要概念;經由文獻引用所產生的共被引關係和書目耦合關係則可以用來分析科學研究領域的知識結構(intellectual structure)。透過上面對於科學映射分析流程的分析,可以知道一個科學映射分析工具最好能夠具備以下的特性:a) 包含多種不同的模組來處理科學映射工作流程中的各個步驟;b) 具備強大的消除重複模組;c) 能夠建構各種書目計量的大型網絡;d) 具有良好的視覺化技術;e)輸出結果應該包含書目計量的測量結果與指標。本研究所提出的SciMAT工具具備上述的各種特性。SciMAT包含三個重要的模組:知識庫(knowledge base)、工作流程的配置以及測量結果與映射圖的視覺化模組。SciMAT的知識庫模組提供分析者匯入各種書目來源的檢索結果,將文件的作者、關鍵詞、期刊和參考文獻等各種資料儲存於知識庫內。運用此知識庫提供的功能,分析者能夠進行編輯與前處理等改善資料品質的工作以獲得更好的分析結果。SciMAT的工作流程配置模組循序漸進地設定分析的時間區段、分析的項目單位和關係、使用資料的次數閾值、進行資料正規化的相似性測量方式、叢集方式以及網絡分析、成效分析、時間分析和歷時性分析等相關參數。視覺化模組可以針對每個分析時段(period)提供詳細的網路圖、策略圖表以及相關的書目計量測量結果,也能夠提供代表研究主題(theme)的叢集在不同時段的演進情形等歷時性(longitudinal)的圖表

The general workflow in a science mapping analysis has different steps (Börner et al., 2003; Cobo et al., 2011b) (see Figure 1): data retrieval, data preprocessing, network extraction, network normalization, mapping, analysis, and visualization. At the end of this process, the analyst has to interpret and obtain conclusions from the results.

Usually, the data retrieved from the bibliographic sources contain errors, so a preprocessing process must be applied first. In fact, the preprocessing step is one of the most important to obtain good results in science mapping analysis. Different preprocessing processes can be applied to the raw data, such as detecting duplicate and misspelled items, time slicing, data reduction, and network reduction (for more information, see Cobo et al., 2011b).

A co-occurrence relation is established between two units (authors, terms, or references) when they appear together in a set of documents; that is, when they co-occur throughout the corpus.

A coupling relation is established between two documents when they have a set of units (authors, terms, or references) in common. Furthermore, the coupling can be established using a higher level unit of aggregation, such as authors or journals. That is, a coupling between two authors or journals can be established by counting the units shared by their documents (using the author’s or journal’s oeuvres).

Finally, a direct linkage establishes a relation between documents and references, particularly a citation relation.

In addition, different aspects of a research field can be analyzed depending on the units of analysis used and the kind of relation selected (Cobo et al., 2011b).

For example, using the authors, a coauthor or coauthorship analysis can be performed to study the social structure of a scientific field (Gänzel, 2001; Peters & van Raan, 1991).

Using terms or words, a co-word (Callon, Courtial, Turner, & Bauin, 1983) analysis can be performed to show the conceptual structure and the main concepts dealt with by a field.

Cocitation (Small, 1973) and bibliographic coupling (Kessler, 1963) are used to analyze the intellectual structure of a scientific research field.

We therefore think it would be desirable to develop a science mapping software tool that satisfies the following requirements: (a) it should incorporate modules to carry out all the steps of the science mapping workflow, (b) it should present a powerful de-duplicating module, (c) it should be able to build a large variety of bibliometric networks, (d) it should be designed with good visualization techniques, and (e) it should enrich the output with bibliometric measures.

SciMAT generates a knowledge base from a set of scientific documents where the relations of the different entities related to each document (authors, keywords, journal, references, etc.) are stored. This structure helps the analyst to edit and preprocess the knowledge base to improve the quality of the data and, consequently, obtain better results in the science mapping analysis.

Taking into account the GUI, there are three important modules: (a) a module dedicated to the management of the knowledge base and its entities, (b) a module (wizard) responsible for configuring the science mapping analysis, and (c) a module to visualize the generated results and maps. These modules allow the analyst to carry out the different steps of the science mapping workflow.

Regarding its functionalities, the module to manage the knowledge base is responsible for building the knowledge base, importing the raw data from different bibliographical sources, and cleaning and fixing the possible errors in the entities. It can be considered as a first stage in the preprocessing step.

As shown, the workflow is divided into four main stages: (a) to build the data set, (b) to create and normalize the network, (c) to apply a cluster algorithm to get the map, and (d) to perform a set of analyses. These stages and their respective steps are described below:
1. Build the data set: At this stage, the user can configure the periods of time used in the analysis (select the periods), the aspects that he or she wants to analyze (select the unit of analysis:  the conceptual (using terms or words), social (using authors), and intellectual (using references) aspects), and the portion of the data that has to be used (to filter the data using a minimum frequency as a threshold).
2. Create and normalize the network: At this stage, the network is built using co-occurrence or coupling relations or, indeed, aggregating coupling. Then, the network is filtered to keep only the most representative items. Finally, a normalization process is performed using a similarity measure (association strength (Coulter et al., 1998; van Eck &Waltman, 2007), Equivalence Index (Callon et al., 1991), Inclusion Index, Jaccard Index (Peters & van Raan, 1993), and Salton’s cosine (Salton & McGill, 1983).
3. Apply a clustering algorithm to get the map and its associated clusters or subnetworks: At this stage, the clustering algorithm used to build the map has to be selected. Different clustering methods are available in SciMAT, such as the Simple Centers Algorithm (Cobo et al., 2011a; Coulter et al., 1998), Single-linkage (Small & Sweeney, 1985), and variants such as Complete-linkage, Average-linkage, and Sum-linkage.
4. Apply a set of analyses: The final step of the wizard consists of selecting the analyses to be performed on the generated map.
(a) Network analysis: By default, SciMAT adds Callon’s density and centrality (Callon et al., 1991; Cobo et al., 2011a) as network measures to each detected cluster in each selected period. Callon’s centrality measures the degree of interaction of a network with other networks, and it can be understood as the external cohesion of the network. ... Callon’s density measures the internal strength of the network, and it can be understood as the internal cohesion of the network. ... These measures are useful to categorize the detected clusters of a given period in a strategic diagram (Cobo et al., 2011a).
(b) Performance analysis: SciMAT is able to assess the output according to several performance and quality measures. To do that, it incorporates into each cluster a set of documents using a document mapper function and then calculates the performance based on quantitative and qualitative measures (using citation-based measures, number of documents, etc.).
(c) Temporal analysis or longitudinal analysis: This allows the user to discover the conceptual, social, or intellectual evolution of the field. SciMAT is able to build an evolution map to detect the evolution areas (Cobo et al., 2011a) and an overlapping items graph (Price & Gürsey, 1975; Small, 1977) across the periods analyzed. Furthermore, SciMAT allows the user to choose different measures to calculate the weight of the “evolution nexus” (Cobo et al., 2011a) between the items of two consecutive periods, such as association strength (Coulter et al., 1998; van Eck & Waltman, 2007), Equivalence Index (Callon et al., 1991), Inclusion Index, Jaccard’s Index (Peters & van Raan, 1993), and Salton’s cosine (Salton & McGill, 1983).

At the end of all the steps in the wizard, the map would be built using the selected configuration. Then, the results would be saved to a file, and the visualization module loaded. The visualization module has two views: Longitudinal and Period.

The Period view (see Figure 12) shows detailed information for each period, its strategic diagram, and for each cluster, the bibliometric measures, the network, and their associated nodes.

Finally, in the Longitudinal view the overlapping map and evolution map are shown. This view helps us to detect the evolution of the clusters throughout the different periods, and study the transient and new items of each period and the items shared by two consecutive periods.

Taking into account quantitative measures such as the number of documents associated with each theme (cluster), we can discover where the fuzzy community has been employing a great effort (e.g., H-INFINITY-CONTROL, FUZZY-CONTROL, T-NORM, etc.). Similarly, taking into account the qualitative measure, we could identify the themes with a greater impact; that is, the themes that have been highly cited.

Combining the units of analysis and the bibliographic relations among them, SciMAT can extract 20 kinds of bibliographic networks, including the common bibliographic networks used in the literature, such as coauthor (Gänzel, 2001; Peters & van Raan, 1991), bibliographic coupling (Kessler, 1963), journal bibliographic coupling (Small & Koenig, 1977), author bibliographic coupling (Zhao & Strotmann, 2008), cocitation (Small, 1973), journal cocitation (McCain, 1991), author cocitation (White & Griffith, 1981), and co-word (Callon et al., 1983).

2013年3月19日 星期二

Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2), 251-263.

Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology57(2), 251-263.

information visualization


本研究提出相關性(relatedness)測量的評量架構,並利用這個評量架構判定六種交互引用(intercitation)和四種共被引(cocitation)的相關性測量方式以及應用到視覺化演算法的結果,六種應用於交互引用的相關性測量方式包括原始的引用次數、cosine指標、Jaccard指標、Pearson相關係數、Pudovkin & Fuseler (1995)和Pudovkin & Garfield (2002)根據期刊引用應用所提出的相關因素(relatedness factor)、以及本研究提出的由cosine指標減去期望的cosine值(expected cosine value)的K50指標,四種應用於交互引用的相關性測量方式則有原始的共被引次數、cosine指標、Pearson相關係數和K50指標。本研究所提出來的評量架構是以一組已經分類的物件為基礎,評估各種相關性測量方式的準確度(accuracy)、覆蓋率(coverage)、可擴展性(scalablity)和強健性(robustness)。例如對期刊間的相關性進行測量時,可以利用ISI的期刊分類為評估物件間相關性的基礎。準確度是指能夠正確地判斷對象間是否相關,可以再區分成區域準確度(local accuracy)與整體準確度(global accuracy),區域準確度是指物件與其他最接近物件是否能夠被正確地放置與排序的趨勢,也就是在同一分類的物件是否具有比在不同分類的物件更高的相關性,整體準確度是指分類之間的位置與排序等關係。覆蓋率則是指的是某一個閾值(threshold)以上的相關性所得到的正確分類結果占所有應有的分類結果的比例。可擴展性是指這種測量方式能否應用於非常大型的資料集合,與測量方式的計算量有關。強健性則是指將相關性測量的結果應用到視覺化演算法進行維度縮減(dimensional reduction)處理後,物件在產生圖形上的映射點間的相關性能否保留原先測量方式的相關性的關係。這四種評估指標彼此間有所關連,例如較大的覆蓋率通常會得到較不準確的結果;而如果希望得到較準確的結果,採用較多計算量的測量方式便無法達到較好的可擴展性;而在維度縮減後,也可能導致準確度變差;最後,以交互引用資料做為相關性測量方式的輸入,能夠利用最近期的資料,獲得較準確的結果,但以共被引資料做為輸入,則能夠包含不在分析期刊中的來源。本研究以2000年ISI的SCIE(science citation index extended)和SSCI(social science citation index)的期刊交互引用和共被引資料為例,共計7121筆期刊,期間的交互引用資料超過1624萬筆,以這些資料計算上述的10種相關性測量方式,並且應用VxOrd進行視覺化計算。研究結果發現:在各種覆蓋率之下,交互引用的cosine(IC-Cosine)和以cosine為基礎的K50(IC-K50)兩種測量方式比其他的測量方式在預測分類時較為準確,相較於需要較多計算資源的Pearson相關係數在應用上較為可行。不論是交互引用或是共被引資料的原始次數在這個利用分類做為準確率評估標準的研究裡,都不理想。此外,交互引用的各種測量方式大多比共被引資料的測量方式更為準確。並且最為特別的是經過VxOrd的視覺化處理,各種測量方式都有比原先的測量方式得到更高的準確率。

The authors propose a new framework for assessing the performance of relatedness measures and visualization algorithms that contains four factors: accuracy, coverage, scalability, and robustness.

This method was applied to 10 measures of journal–journal relatedness to determine the best measure. The 10 relatedness measures were then used as inputs to a visualization algorithm to create an additional 10 measures of journal–journal relatedness based on the distances between pairs of journals in two-dimensional space. This second step determines robustness (i.e., which measure remains best after dimension reduction).

Results show that, for low coverage (under 50%), the Pearson correlation is the most accurate raw relatedness measure. However, the best overall measure, both at high coverage, and after dimension reduction, is the cosine index or a modified cosine index. Results also showed that the visualization algorithm increased local accuracy for most measures.

The two main groups of measures are intercitation measures, or those based on one journal citing another, and cocitation measures, which are based on the number of times two journals are listed together in a set of reference lists.

Although raw frequency has been used for both journal citation (Boyack, Wylie, & Davidson, 2002) and journal cocitation analysis studies in the past (McCain, 1991), it is rarely used today.

For intercitation studies, normalized frequencies such as the cosine, Jaccard, Dice, or Ochiai indexes (Bassecoulard & Zitt, 1999) are very simple to calculate, and give much better results than raw frequencies (Gmur, 2003).

A new type of normalized frequency, specific to journals, has been proposed recently (Pudovkin & Fuseler, 1995; Pudovkin & Garfield, 2002). This new relatedness factor (RF), an intercitation measure, is unique in that it is designed to account for varying journal sizes, thus giving a more semantic or topic-oriented relatedness than other measures.

The Pearson correlation coefficient, known as Pearson’s r, is a commonly used measure for journal intercitation (Leydesdorff, 2004a, 2004b), journal cocitation (Ding, Chowdhury, & Foo, 2000; McCain, 1992, 1998; Morris & McCain, 1998; Tsay, Xu, & Wu, 2003), document cocitation (Chen, Cribbin, Macredie, & Morar, 2002; Gmur, 2003; Small, 1999; Small, Sweeney, & Greenlee, 1985), and author cocitation studies (cf. White, 2003; White & McCain, 1998).

Lists of relatedness measurements are rarely analyzed directly, but are used as input to an algorithm that reduces the dimensionality of the data, and arranges the tokens on a 2-D plane. The distance between any two tokens on the 2-D plane is thus a secondary (or reduced) measure of relatedness.

Validation of relatedness measures has received little attention over the years. Most of these efforts have been to compare 2-D maps obtained from MDS with some sort of expert perceptions of the subject field.

Only one study has compared citation-based relatedness measures. Gmur (2003) compared six different relatedness measures based on the cocitation counts of 194 highly cited documents in the field of organization science. The measures included raw frequency, three forms of normalized frequency, Pearson’s r, and loadings from factor analysis. The bases for comparison were network-related metrics such as cluster numbers, sizes, densities, and differentiation. Results were strongly influenced by similarity type. For optimum definition of the different areas of research within a field, and their relationships, clustering based on Pearson’s r or on the combination of two types of normalized frequency worked best.

Accuracy refers to the ability of a relatedness measure to identify correctly whether tokens (e.g., journals, documents, authors, or words) are related.

Local accuracy refers to the tendency of the nearest tokens to be correctly placed or ranked. Ideally, local accuracy is measured from the perspective of each individual token. For authors, the question might be whether an author would agree with the ranking of the 10 most closely related authors. For journals, the question might be whether the closest journals were in the same discipline. For papers, the question might be whether the closest papers were on the same topic.

Global accuracy refers to the tendency for groups of tokens to be correctly placed or ranked, and requires that the tokens be clustered.

The assessment of accuracy requires some sort of independent data to use as a basis of comparison.

Coverage helps to assess the impact of thresholds on accuracy. In this analysis, thresholds are used to identify all relationships that are at or above a certain level of accuracy. Very high thresholds of relatedness will tend to identify the relationship between a few tokens, lower thresholds will include more tokens, but the level of accuracy will likely be lower.

Scalability refers to the ability of a measure (or a derived measure from a visualization program) to be applied to extremely large databases.

Robustness refers to the ability of a measure to remain accurate when subjected to visualization algorithms. Visualization algorithms reduce the dimensionality of the data, and it is reasonable to assume that the reduction in dimensionality will affect the accuracy of the measure. While the visualizations allow a user to gain insights into the underlying structure of the data, these insights should be qualified by an assessment of the concurrent loss of accuracy.

One expectation is that greater coverage will result in lower accuracy.

Another expectation is that the measures that utilize more data and more calculations will be more accurate but less scalable.

A third expectation is that accuracy will drop when a measure is subjected to dimension-reduction techniques because the underlying data is inherently multidimensional.

The last tradeoff refers to the choice of intercitation versus cocitation measures. On the one hand, intercitation-based measures should be more accurate because the data are more current (current year to past years rather than past-year pairs). On the other hand, cocitation measures can cover far more sources.

The data used to calculate relatedness measures for this study were based on intercitation and cocitation frequencies obtained from the ISI annual file for the year 2000. Science Citation Index Expanded (SCIE; Thomson ISI, 2001a) and Social Science Citation Index (SSCI; Thomson ISI, 2001b) data files were merged, resulting in 1.058 million records from 7349 separate journals. Of the 7349 journals, we limited our analysis to the 7121 journals that appeared as both citing and cited journals. There were a total of 16.24 million references between pairs of the 7121 journals.

The resulting journal–journal citation frequency matrix was extremely sparse (98.6% of the matrix has zeros). While there was a great deal more cocitation frequency information, the journal–journal cocitation frequency matrix was also sparse (93.6% of the matrix has zeros).

The 10 relatedness measures used in this study are given below, along with their equations. The six intercitation measures are raw frequency, Cosine, Jaccard, Pearson’s r, the recently introduced average relatedness factor of Pudovkin and Garfield (2002), and a new normalized frequency measure that we introduce here, K50. ... Note that the new measure, K50, is simply the cosine index minus an expected cosine value. ... The four cocitation measures are raw frequency, cosine, Pearson’s r, and the cocitation version of the K50 measure.

As mentioned above, for each of the 10 relatedness measures, a dimension reduction was done using VxOrd. The process for calculating “re-estimated measures” is as follows. First, 2-D coordinates were calculated for each of the 7121 journals using VxOrd (cf. Figure 2). Next, the distances between each pair of journals (on the 2-D plane) were calculated for the entire set and used as the re-estimated measures of relatedness.

The IC-Pearson measure is the most accurate for higher absolute levels of relatedness (up to a rank of ~85,000). As ranked relatedness increases, the curves for all but the IC-Raw measure converge. IC-Cosine, IC-K50, and IC-Jaccard measures generate nearly identical results over the entire relatedness range
up to a rank of ~125,000.

The CC-Pearson measure is the best of the four up to a rank of ~350,000, and then
drops below the CC-Cosine and CC-K50. The CC-K50 is slightly more accurate than the CC-Cosine, and the raw frequency measure, CC-Raw, gives the worst results by far.

Figure 4a shows that for the intercitation measures, the IC-Cosine and IC-K50 measures cover more journals than the other measures over the entire range of rank relatedness. The IC-Jaccard and IC-RFavg measures have the next highest coverage, followed by the IC-Pearson. The IC-Raw covers the fewest journals over most of the range.

The CC-Cosine and CC-K50 have the highest coverage, followed by the CC-Pearson. Once again, raw frequency gives the worst results.

The IC-Pearson measure is more accurate for up to a coverage of 0.58, while the IC-Cosine and IC-K50 are more accurate for coverage past 0.58. Note that, excepting the raw frequency measures, both of which do poorly, the intercitation measures are more accurate than the cocitation measures.

First, the IC-Cosine, IC-K50, and IC-Jaccard measures all have roughly comparable accuracy over the entire range of coverage. The IC-K50 measure is slightly more accurate than the others from 20–50% coverage, while the IC-Cosine is the most accurate from 50–90% coverage. The IC-Pearson measure remains below these three over the entire coverage range.

Second, the intercitation measures are more accurate than the cocitation measures in all cases.

Third, the Pearson measures are less accurate than the cosine measures for both the intercitation and cocitation data.

Also, note that the re-estimated K50 measures are essentially identical to the cosine measures for both the intercitation and cocitation data. Any differences at a particular coverage value are small enough to justify using the cosine value, which requires less calculation. It appears that, although the K50, by virtue of subtracting out the expected values, gives different individual similarity values and rankings, the aggregate effect on overall accuracy is minimal.

The most striking result comes from a comparison of the results of Figures 5 and 6, namely that the overall accuracy for all re-estimated measures is higher than for the raw measures over nearly the entire coverage range. This is an extremely counterintuitive finding, given the prevailing and common belief that information is lost when dimensionality is reduced.

Three of the intercitation measures (IC-Cosine, IC-K50, and IC-Jaccard) perform similarly, all with high-accuracy values at the both the 50% and 95% coverage levels.

All of the intercitation measures are limited to use within the citing journal set. If coverage outside the citing journal set is desired, cocitation measures can be used. Of these, the new measure introduced in this paper, CC-K50, is slightly better than the Cosine at high-coverage levels. Both the CC-Cosine and CC-K50 are clearly better than the Pearson correlation, both in terms of accuracy, and in that they do not require n(square)  calculations, and thus scale to much larger sets than the Pearson.

First, we expected the Pearson correlation to provide the best results. The reason for this expectation is that the Pearson correlation uses more information in its construction (nearly the entire intercitation or cocitation matrix) than do the other measures. Pearson correlations allow for the influence of other parties. On the other hand, the other measures only use a small amount of the data in the matrix, and tend to limit their focus to the relationship between the two journals in
question.

The second surprise was the increase in performance from the visualization software. We expected the performance to deteriorate due to the simple rule of thumb that reducing data to two dimensions requires tradeoffs that would result in lower accuracy.

The improvement in performance may be explained by the peculiarities of the VxOrd force directed algorithm. VxOrd balances attractive forces between nodes (the similarity values) with those of a repulsive grid that tries to force all nodes apart. It also cuts edges once the similarity-to-distance ratio falls below a threshold, and in most cases cuts about 50% of the original edges, thus leaving edges only where particularly strong similarities exist among a set of nodes. These dominant similarities are likely to be very accurate on the whole, and when concentrated by pruning the less accurate edges, may increase the overall accuracy of the solution.

2013年3月15日 星期五

Otte, E & Rousseau, R. (2002). Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science, 28 (6) 2002, pp. 441–453.


Otte, E & Rousseau, R. (2002). Social network analysis: a powerful strategy, also for the information sciences. Journal of Information Science, 28 (6) 2002, pp. 441–453.

network analysis

在資訊計量學的範疇內,合作、引用、共被引等許多關係都可以被考慮用來建置社會網絡(social networks)。本研究簡介社會網絡分析(social network analysis)的概念和發展與在資訊科學裡的相關研究和文獻,並利用資訊計量學和網絡分析對於社會網絡分析進行探討。本研究強調在社會脈絡下,比起行動者(actor)本身的特質,社會網絡分析更著重於行動者之間的關係,有必要兼顧兩者來完全了解社會現象。此外,社會網絡分析的研究也嘗試了解結構性的規律(structural regularities)如何影響行動者的行為。在社會網絡分析的概念上,首先區分以特定行動者之關係為中心的自我網絡(ego network)與全部參與者之關係的整體網絡(global network)。網絡以圖形(graph)來表示,每個行動者對應到圖形上的一個節點(nodes),行動者之間如果具有分析的關係,圖形上的節點之間便有一條連結線(link);如果行動者間的關係不對稱,則建構起來的圖形是有方向性的(directed);否則,此圖形為無方向性的(undirected)。兩個節點之間的路徑(path)在社會網絡分析的定義是由連結行動者間的不同連結線所構成,連結線的數目為該路徑的長度。圖形上彼此間相連結的節點與其之間的連結線組成成分(component),成分上的每一對節點彼此間至少存在一條路徑,並且兩個節點之間如果有路徑相通,則這兩個節點同在一個成分上。完全圖形(complete graph)上的每一對節點之間都有連結線相連。圖形的密度(density)是圖形上所有連結線數目除以相同節點數目的完全圖形的連結線數。節點的程度中心性(degree centrality)為連結這個節點的連結線數目除以節點數目減一,接近中心性(closeness centrality)為節點數目減一除以這個節點到其他節點的最短路徑長度的總和,中介中心性(betweenness centrality)是除此節點外的任兩個節點之最短路徑會通過此節點之平均比率除以圖形上除此節點外任何兩節點的數目((n-1)(n-2)/2=(n2-3n+2)/2)。派系(clique)則是圖形上一群彼此間有連結線相連的節點與其連結線構成的次圖形(subgraph)。社會網絡分析的發展為1980年代開始,其興起的原因包括Barry Wellman所成立的研究專業組織the International Network for Social Network Analysis(INSNA)以及專書與分析軟體的大量出現,著名的專書有Knoke & Kuklinski(1982)、Wellman & Berkowitz(1988)、Scott(1991)和Wasserman & Faust(1994)等,軟體則有UCInet、Gradap、Multinet、Negopy以及Pajek等等。1601筆社會網絡分析相關論文裡出現3次以上的作者共有133位,他們所形成的合著網絡上最大的成分共有57位作者,密度只有0.05,很明顯地是一個相當稀疏的網絡,程度中心性最高的作者是Barry Wellman,共曾經和其他9位作者合作,接近中心性與中介中心性最高的作者都是Patrick Doreian,表示他連結到所有的作者的路徑最短並且連結許多不同群的作者。

In informetrics, researchers study citation networks, co-citation networks, collaboration structures and other forms of social interaction networks [11–19].
This individualistic approach ignores the social context of the actor [21]. One could say that properties of actors are the prime concern here.
In SNA, however, the relationships between actors become the first priority, and individual properties are only secondary. Relational data are the focus of the investigations.
It should be pointed out, however, that individual characteristics as well as relational links are necessary in order to fully understand social phenomena [21].
Wetherell et al. [22, p. 645] describe SNA as follows:
Most broadly, social network analysis (1) conceptualises social structure as a network with ties connecting members and channelling resources, (2) focuses on the characteristics of ties rather than on the characteristics of the individual members, and (3) views communities as ‘personal communities’, that is, as networks of individual relations that people foster, maintain, and use in the course of their daily lives.
Another important aspect of SNA is the study of how structural regularities influence actors’ behaviour.
In ‘ego’ studies the network of one person is analysed. ...
In global network analyses one tries to find all relations between the participants in the network.
A directed graph G, a digraph, consists of a set of nodes, denoted as N(G), and a set of links (also called arcs or edges), denoted as L(G). ... In sociological research nodes are often referred to as ‘actors’. A link e, is an ordered pair (i,j) representing a connection from node i to node j. Node i is called the initial node of link e, i = init(e), and node j is called the final node of the link: j = fin(e).
If the direction of a link is not important, or equivalently, if existence of a link between nodes i and j necessarily implies the existence of a link from j to i, we say that this network is an undirected graph.
A path from node i to node j is a sequence of distinct links (i, u1), (u1,u2), . . ., (uk,j). The length of this path is the number of links (here k+1).
An undirected graph can be represented by a symmetrical matrix M = (mij), where mij is equal to 1 if there is an edge between nodes i and j, and mij is 0 if there is no direct link between nodes i and j.
A component of a graph is a subset with the characteristic that there is a path between any node and any other one of this subset. If the whole graph forms one component it is said to be totally connected.
The density is an indicator for the general level of connectedness of the graph. If every node is directly connected to every other node, we have a complete graph. The density of a graph is defined as the number of links divided by the number of vertices links in a complete graph with the same number of nodes.
Degree centrality of a node is defined as the number of ties this node has (in graph-theoretical terminology, the number of edges adjacent to this node). ... The degree centrality in an N-node network can be standardized by dividing by N–1: dS(i) = d(i)/(N-1).
Closeness centrality of a node is equal to the total distance (in the graph) of this node from all other nodes. ... Closeness is an inverse measure of centrality in that a larger value indicates a less central actor while a smaller value indicates a more central actor. For this reason the standardized closeness is defined as cS(i) = (N–1)/c(i), making it again a direct measure of centrality.
Finally, betweenness centrality may be defined loosely as the number of times a node needs a given node to reach another node. Stated otherwise, it is the number of shortest paths that pass through a given node. ... Betweenness gauges the extent to which a node facilitates the flow in the network. It can be shown that for an N-node network the maximum value for b(i) is (N2-3N+2)/2. Hence the standardized betweenness centrality is: bs(i) = 2b(i)/(N2-3N+2).
A clique in a graph is a subgraph in which any node is directly connected to any other node of the subgraph.
The three graphs (Figs 2–4) demonstrate the fact that it was only in the early 1980s that SNA started its career. The main reasons for this are the institutionalization of social network analysis since the late 1970s, and the availability of basic textbooks and computer software.
The institutionalization of the field began with the foundation in 1978 by Barry Wellman of the International Network for Social Network Analysis
(INSNA). This is the professional association for researchers interested in social network analysis. Its principal functions are the publication of the informal bulletin Connections, containing news, scholarly articles, technical columns, abstracts and book reviews; sponsoring the annual International Social Networks Conference (also known as Sunbelt) and maintaining electronic, web-based services for its members. The society also publishes, in association with Elsevier, the peer-reviewed international quarterly Social Networks.
The earliest basic text that the authors know of dealing exclusively with social network analysis is Knoke and Kuklinski’s Network Analysis, published in1982. Other important books having influenced the growth of the discipline are Wellman and Berkowitz’ Social Structures: a Network Approach (1988), Scott’s Social Network Analysis: a Handbook (1991), and Wasserman and Faust’s Social Network Analysis: Methods and Applications (1994).
The development of dedicated software also led to an increase in interest in the field and its methods. The best-known (and very user-friendly) program for the analysis of social networks is UCInet. ... UCInet can easily be combined with Krackplot, a well-known program for drawing social maps. Other examples of computer programs for social network analysis are Gradap, Multinet, Negopy and Pajek.
In the 1601 articles dealing with SNA there were 133 authors occurring three times or more. Forming an undirected co-authorship graph (of these 133 authors) led to a big connected component of 57 authors, two components of four authors, two components of three authors, seven small components consisting of two authors and 48 singletons.
The density for the central network of network analysts is 0.05, so this network is clearly not dense at all, but very loose.
In this network being a central author means that this scientist has collaborated (in the sense of co-authored) with many colleagues. The author with the highest degree centrality is Barry Wellman (University of Toronto), who has a degree centrality of 9.
A high closeness for an actor means that he or she is related to all others through a small number of paths. The most central author in this sense is Patrick Doreian (University of Pittsburgh).
Actors with a high betweenness play the role of connecting different groups, as ‘middlemen’. Again Patrick Doreian has the highest betweenness.
A small-world network is then characterized as a network exhibiting a high degree of clustering and having at the same time a small average distance between nodes. Moreover, the ‘hubs’ and ‘authorities’ approach is related to the Pinski–Narin influence weight citation measure [46] and mimics the idea of ‘highly cited documents’ (authorities) and reviews (hubs) [1].