2013年3月27日 星期三

Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics, 57(1), 27-57.

Gmür, M. (2003). Co-citation analysis and the search for invisible colleges: A methodological evaluation. Scientometrics57(1), 27-57.

information visualization

共被引是指兩筆文獻出現在同一論文的參考文獻的現象,共被引的次數可以代表這兩筆文獻的內容具有相關性的程度。網路圖將文獻對應到圖形上的節點,並且依據節點對應文獻之間的共被引相關性的程度大小,決定節點之間是否有連結線,從節點彼此間的連結線構成完全相連的群體或是多個節點共同連結到另一個節點構成星狀(star-shaped)群體等判斷形成叢集(clusters)的情況,並且進而從叢集內各節點對應的文獻確認叢集所代表的主題以及從各主題之間的結構了解整個學科的情形。本論文以組織科學(organization science)為例,討論各種共被引程度的計算以及叢集方法應用於網路圖上呈現的叢集所代表的主題以及各主題之間的結構,包括 1)共被引次數,2)相對於被引次數平均值的共被引次數,3)相對於被引次數最小值的共被引次數,4)相對於被引次數平均值和最小值的共被引次數,5)以共被引次數為基礎的Pearson相關係數和6)以共被引次數為基礎的因素分析。在分析各種方法的叢集結果時,除了直接說明叢集的結果以外,也利用叢集規模大小(cluster size)、叢集密度(cluster density)、中心勢(centalization)、叢集在該網路上的區別性(differentiation within the network)、整體網路的滲透性(penetration of the complete network)等各種數值的結果進行評估。在各種方法中,相對於被引次數平均值的共被引次數、相對於被引次數平均值和最小值的共被引次數和以共被引次數為基礎的Pearson相關係數所得到的叢集大多以完全相連的群體為核心而擴展,在各種評估的數值上相接近,各叢集代表的主題也類似;相對於被引次數最小值的共被引次數得到的叢集則通常以某一個高度被引用的文獻為核心,具有較大的中心勢。

This paper summarizes the present state of co-citation analysis and presents several methods of clustering references.
The database used is a selection of 2,114 documents in the field of organization studies from 1986-2000. .... For the calculation of co-citation networks, the references were reduced to those cited in at least 2% of the evaluated articles, i.e. at least 42 times. This resulted in a data set of 15,761 citations distributed between 194 different references.
Co-citation analysis enables the identification of groups of scientists and their publications, and for conclusions to be drawn about the inner structure of research disciplines, schools or paradigms (SMALL, 1980).
A cocitation is taken to exist if two references or authors appear in the same bibliography. It is interpreted as the measure for similarity of content of the two references or authors. The number of co-citations determines the proximity of any two publications in terms of content.
To date, studies have been focused on biomedicine, which comprises the largest area of publication within the SCI (e.g., SMALL & GRIFFITH, 1974; MULLINS et al., 1977; SMALL, 1977; 1980; SMALL & GREENLEE, 1980; 1989; MCCAIN, 1989; 1991), as well as on information science and sociology of knowledge (WHITE & GRIFFITH, 1981; 1982; KÄRKI, 1996; PERSSON, 1994; WHITE & MCCAIN, 1998).
In most cases, documents or authors are chosen on the basis of their frequency of citation within a delineated ISI database, a criterion that meets the basis premises of co-citation analysis.
As SMALL & SWEENEY (1984) have shown in their comparison of methods, it can be an advantage to define citations not in absolute terms but in their relation to the length of the citing document’s bibliography.
Several alternatives for the weighting of co-citation counts have developed:
1. The highest absolute co-citation counts are assessed for cluster formation, up to a specified limit.
2. Co-citations are measured in relation to the co-citation partners’ citation counts.
3. Co-citations are converted into Pearson’s correlation coefficients, which also achieves a standardization effect.
The macro approach focuses on the overall structure of disciplines, and, ultimately, on the question of which laws govern the evolution of science.
The micro approach aims to describe retrospectively the structure and historical development of individual disciplines or schools of research and their interdependencies. For pragmatic reasons, an author-centred approach dominates here, and scientists tend to rely upon procedures of correlation and factor analysis for their studies. With this approach, a discipline’s structures and lines of development are drawn along its most prominent representatives.
Here, ‘cluster’ is the name given to a group of references with multiple connections to each other, defined by the connection rules for the co-citation network. ... In the following analysis, clusters are interpreted as a self-contained community only if they feature at least one completely interconnected group of three references, or a group of five references with star-shaped connections.
x Cluster size is defined as the number of references that comprise a cluster..
x The in-degree is the sum of internal relationships, i.e., of all co-citations between the references in the cluster yielded by the method of calculation.
x Cluster density is expressed as a percentage and calculated as the quotient of the cluster’s actual in-degree and its maximum possible in-degree.
x Centralization is a measure for the dominance of a single reference within a cluster. Centralization is calculated as the quotient of the co-citation sum of the most-connected reference within the cluster and the mean co-citation sum of all references.
x The out-degree is the sum of all external relationships, i.e., of all co-citations between internal cluster references and those external to it.
x Differentiation within the network serves as an indicator as to how well clusters may be distinguished from one another. Differentiation is calculated as a quotient from the sum of all clusters’ in-degrees and out-degrees.
x Network size is defined as the sum of all references yielding at least one cocitation under the particular method of calculation.
x Penetration of the complete network is expressed as a percentage and calculated as the quotient of the sum of all references in the data set and the size of the network resulting from the calculation method.
x Network differentiation indicates the effectiveness of the method of calculation in representing differences between clusters.
Method 1: Maximum co-citation counts
Cluster formation on the basis of the maximum absolute co-citation counts. The highest 200 counts were selected from the co-citation matrix.
This points to the fact that this method of network formation, in a discipline with a large citation count variance and a low level of polarization in terms of content, is most suitable for the representation of a school’s mainstream research. Often-cited references without frequent cross-references and their subsequent low co-citation counts are thereby discarded, which represents an advantage over straightforward ordering by citation proportions. However, differentiation between clusters is quite low, due to the close connections of citation- and co-citation counts for the most-cited references.
Method 2: Co-citation counts relative to citation mean
To a large extent, this method corresponds to the Jaccard coefficient used by SMALL & GREENLEE (1980), which places the co-citation count in relation to the sum of both partners’ individual citations, less the co-citation count.
This method highlights symmetrical co-citation relations, i.e., co-citations between documents that are cited with similar frequency, as opposed to relations between documents cited with differing frequency. This is manifest in the fact that core documents in the large clusters show far fewer relations than in method 1, where they occupy a central position. In this respect, the method is sensitive to the citation count variance of the data set: the greater the variance, the greater the risk that important cocitations between references cited with differing frequency are overlooked. Aside from the clusters, a series of bridging documents emerge as linking components between individual clusters. This method leads to these documents remaining outside of the study, because taking them into account would diminish the density when the out degree is taken into account.
Method 3: Co-citation counts relative to citation minimum
Using this method, the co-citation count is set in relation to the citation frequency of the smaller of the two co-citation partners. ... This method lends unspecific citations of the most-cited references lesser weight, whereas close co-citation relations between lesser-cited references are highlighted.
However, here the asymmetrical relations between references that are cited most frequently and those cited relatively infrequently are highlighted. Correspondingly, more probable star-shaped clusters are generated around a small number of central references.
Method 4: CoCit-Score
It demonstrates a considerably higher degree of robustness than the previous three methods because it reduces the influence of the citation relation between two co-cited references. The two calculation methods, based on the citation minimum and the citation mean respectively, are linked by a simple multiplication. The formula for calculating the CoCit-Score, which takes a value between 0 and 1, sets the co-citation count in relation to the minimum and mean counts of the two individual citations:
Method 5: Pearson’s correlation coefficient
Therefore, the maximum correlation coefficient of 1 is achieved if two references with identical individual citation counts are always cited in the same article. Where there are two references with differing frequency of citation, the count decreases even when the less commonly cited reference is always accompanied by the more frequently cited reference. The coefficient reaches the theoretical minimum value of -1 if exactly one of two references is cited in each of the evaluated articles. ... It is seen that the co-citation network content matches the network produced on the basis of the citation mean.
Method 6: Factor loadings
Factor analysis presents itself as an alternative to the counting and weighting of cumulated co-citations, with which clusters may be formed from factors and factor loadings.
This supports the conclusion that, given the high level of structural complexity of the data set, factor analysis does not generate a true representation of the co-citation relations it is based on.
In this instance, factor analysis is found to be unsuitable for the differentiation of subfields of research, as it yields low differentiation and clusters that are difficult to interpret.
Cluster formation using the strongest co-citations is the method chosen mainly in early studies up to the mid-1980s. However, it yields an insufficient level of delimitation in a field such as organization science, which is characterized by a low paradigmatical level of differentiation and a blurring of boundaries between individual research subfields.
Cluster formation on the basis of mean citation count or similar measurements, which present the co-citation in relation to the citation frequencies of the respective co-citation partner, may lead to sufficient differentiation. However, this method is susceptible to asymmetries between individual citations, so that the position of central, influential publications in an area of research is not explicit. The resultant mean centralization count for the clusters is low.
Cluster formation on the basis of the minimal individual citation count leads to centralized clusters around the most-frequently cited references, whereas the cross-links between a school’s secondary references receive only low weighting. In individual cases, the resulting density makes delineation between subfields of research more difficult, which again results in a lower degree of differentiation.
The CoCit-Score, which links two criteria for measuring the relative significance of a co-citation, gives similar weighting to both symmetrical and asymmetrical co-citation pairings. Compared to related methods, cluster size is large, so that it can be assumed that the largest clusters were assigned all relatively close references, without jeopardizing differentiation between the clusters.
An equally valid alternative is to interpret the correlation coefficient of a pairing as a coefficient of similarity. Here, selectivity is maximized as much as differentiation in the overall network.
On the macro-level of the complete network, the influence of not using co-citations as absolute values for cluster formation becomes clear. Here, the opportunities for differentiation are quite low. All other methods lead to a similar number of clusters.
On the meso-level of the clusters, the various methods already reveal clear differences. ... Schools of research close in terms of content are then placed closer still, or further away from each other, depending on the method used. However, with the exception of clustering from factor analysis, with its multiple assignations, no fundamentally new groupings emerge that do not equally demonstrate commonality with the other cluster methods.
The differences are greatest on the micro-level of individual relations. ...  The selection methods on the basis of mean citation count, correlation coefficient and the CoCit-Score demonstrate a higher degree of similarity between themselves, whilst the latter shows on average the highest degree of similarities with the other methods.
x Which publications exert most influence on the discipline?
Co-citation analysis on the basis of maximum co-citation counts provides an alternative to a simple citation count. The result consists of an extendable cluster of frequently cited references closely related to other documents.
x Which communities and areas of research does the discipline encompass?
If the aim is to identify as many dense and selective clusters as possible, then the correlation coefficient is the best criterion for defining the co-citational similarity of two documents. A similarly positive result can be expected if the co-citations are set in relation to the individual citations, with greater weight conferred on the smaller of the two references (CoCit-Score).
x Which documents define the discipline’s communities or areas of research?
The method of minimal individual citation generates star-shaped and thereby sufficiently differentiated clusters around the respective dominant documents. This method is particularly well suited when combined with cluster formation using correlation coefficients or CoCit-Scores.

沒有留言:

張貼留言