van Eck, N. J. and Waltman, L. (2008). Appropriate similarity measures for author co-citation analysis. Journal of American Society for Information Science and Technology, 59, 1653-1661.
information visualization
Ahlgren, Jarneving, and Rousseau (2003)的研究開始,便有許多研究探討作者共被引分析(author cocitation analysis)研究是否適合使用Pearson相關(Pearson correlation)來評估兩位作者間的相似性程度。就理論層面,本研究以兩種條件檢測各種間接式相似性測量方式是否適合做為相似性的測量方式:(1)若且唯若兩位作者間的相似性最大時,他們的共被引特徵檔案最多只是倍數上的差別。(2)若且唯若兩位作者間的相似性最小時,這兩位作者之間沒有相同的共被引作者。Pearson相關因為不符合上述的兩種條件,並不適合做為相似性的測量方式,而符合兩種條件的cosine、Jensen-Shannon 差異(Jensen-Shannon divergence)和Bhattacharyya距離(the Bhattacharyya distance)等三種指標都適合做為相似性的測量方式。另一方面,本研究也利用White and McCain(1998)的資訊科學著名作者共被引資料比較Pearson相關、cosine、Jensen-Shannon 差異和Bhattacharyya距離等四種間接式相似性測量方式,經過多維縮放(multidimensional scaling)處理,將作者間的共被引相似性映射成圖形的結果。結果發現:但除了Pearson相關會使得作者的映射點集中在呈現圖形的邊緣以外,其他的三種相似性測量方式都不會造成中心空洞的圖形,因此作者認為從實務面而言也可以發現Pearson相關並不適合做為作者共被引的相似性測量方式。
Ahlgren, Jarneving, and Rousseau (2003) questioned the appropriateness of the Pearson correlation for measuring the similarity between authors’ co-citation profiles. ... White (2003a) argued that the objections of Ahlgren et al. against the Pearson correlation are mainly of theoretical interest and have little practical relevance, and Bensman (2004) defended the use of the Pearson correlation for statistical inference. Leydesdorff and Vaughan (2006), however, went even further than Ahlgren et al. and asserted that co-citation data should be analyzed directly, without first calculating a similarity measure.
Schneider and Borlund (2007) pointed out that from a statistical perspective the common practice of calculating similarity measures based on co-citation data rather than citation data is quite unorthodox. In addition, they also mentioned some drawbacks of the use of the Pearson correlation as a similarity measure.
In our opinion, a statistically valid analysis can be performed using either citation data or co-citation data (although the two types of data may require different similarity measures).
Suppose that we have a bibliographic data set and that we are interested in analyzing the co-citations of a set of n authors in this data set. Typically, the analysis is performed as follows (see McCain, 1990, for a detailed discussion, and White & Griffith, 1981 and White & McCain, 1998 for well-known examples). First, for each pair of two authors i and j (i <> j), the number of co-citations in the data set, denoted by cij , is counted. Next, the co-citation counts are used to calculate similarities between the authors. Traditionally, this is done using the Pearson correlation as a similarity measure for co-citation profiles. ... As a final step, the similarities between the authors are analyzed using multivariate statistical techniques such as multidimensional scaling and hierarchical clustering.
Based on the above examples, we believe that an appropriate similarity measure for co-citation profiles should at least satisfy the following two conditions:
1. The similarity between two authors is maximal if and only if the authors’ co-citation profiles differ by at most a multiplicative constant.
2. The similarity between two authors is minimal if and only if there is no author with whom the two authors have both been cocited.
1. The similarity between two authors is maximal if and only if the authors’ co-citation profiles differ by at most a multiplicative constant.
2. The similarity between two authors is minimal if and only if there is no author with whom the two authors have both been cocited.
The above examples have shown that the Pearson correlation satisfies neither of these conditions. In our opinion, the Pearson correlation is therefore not a very satisfactory similarity measure for co-citation profiles.
The important point is that a strong linear relationship between the co-citation counts of two authors need not imply a high similarity between the authors and, the other way around, a high similarity between two authors need not imply a strong linear relationship between the co-citation counts of the authors.
Unlike the Pearson correlation, the cosine satisfies the two conditions introduced in the previous section (see Proposition 1 in the Appendix).
Both the Pearson correlation and the cosine have the property that multiplying an author’s cocitation profile by an arbitrary constant has no effect on the author’s similarity with other authors (Anderberg, 1973). This is called the property of coordinate-wise scale invariance by Ahlgren et al. (2003). ... In other words, it guarantees that the similarity between two authors depends only on the relative frequencies with which the authors are cocited with other authors.
Because of the property of coordinate-wise scale invariance, the similarity between two authors calculated using a measure such as the Pearson correlation or the cosine does not change when the authors’ co-citation profiles are normalized to sum to one.
Hence, when we are comparing the co-citation profiles of two authors, what we are in fact doing is comparing the probability distributions of
each of the authors’ co-citations.
each of the authors’ co-citations.
These are the requirements that the value of the similarity measure is maximal if and only if two distributions are identical and that it is minimal if and only if two distributions are nonoverlapping.
Perhaps the most popular similarity measure for probability distributions is the Kullback-Leibler divergence (Kullback & Leibler, 1951) from the field of information theory. However, this similarity measure has difficulties with zero probabilities and hence with zero co-citation counts.
The Jensen-Shannon divergence (Lin, 1991), which is closely related to the Kullback-Leibler divergence, does not have these difficulties and is therefore more interesting from the point of view of ACA.
Another well-known similarity measure for probability distributions is the Bhattacharyya distance (Bhattacharyya, 1943).
JS(i, j) and B(i, j) both have a value between 0 and 1. They have a value of 1 if and only if the probability distributions given by the piks and pjks are identical, and they have a value of 0 if and only if these distributions are nonoverlapping (see Propositions 2 and 3 in the Appendix).
It follows from this that JS(i, j) and B(i, j) both satisfy the two conditions introduced in the previous section.
White (2003a) argues that theoretical shortcomings of the Pearson correlation are problematic only if there is a substantive difference between results based on the Pearson correlation and results based on theoretically sound similarity measures.
First, White (2003b, p. 427) does not seem to be completely satisfied with the maps of the information-science field provided in White and McCain (1998). In particular, he expresses some concerns about the “empty centers” that appear in these maps (also visible in the map in Figure 1). He further notes that the appearance of empty centers is not confined to information science but also happens when mapping other heterogeneous fields. White seems to prefer maps based on pathfinder networks because such maps do not have empty centers. Interestingly, our results seem to indicate that the issue of the empty centers can simply be resolved by using a theoretically sound similarity measure, such as the cosine or the Jensen-Shannon divergence, instead of the Pearson correlation.
Our results point in a different direction. The issue of the empty centers seems to be caused by the use of the Pearson correlation as a similarity measure for co-citation profiles rather than by the use of multidimensional scaling as a mapping technique for author similarities.
In our opinion, however, there are three reasons why the possibility of statistical inference does not give the Pearson correlation an advantage over other similarity measures.
First, it is well known that the distributional assumptions underlying the use of the Pearson correlation for statistical inference are not met in ACA (e.g., Ahlgren et al., 2003; White, 2003a).
For example, the t test for the significance of the Pearson correlation between two random variables assumes that at least one of the two variables is normally distributed (e.g., Snedecor & Cochran, 1989). Since cocitation counts have discrete distributions that are typically highly skewed (e.g., Ahlgren et al., 2003; White, 2003a), this assumption is violated in ACA. ... It follows from this result that the t test for the significance of the Pearson correlation between two variables may not be very accurate when the variables are both nonnormally distributed, as is the case in ACA. In our opinion, it is therefore better not to use the t test in ACA.
Second, even if an appropriate statistical test is used, it is not clear what it means to know that the Pearson correlation between the co-citation counts of two authors is significantly greater than zero (or significantly different from zero).
On the one hand, a positive correlation is not necessary for a high similarity between two authors. ... On the other hand, a positive correlation is also not sufficient for a high similarity between two authors So, a positive correlation is neither necessary nor sufficient for a high similarity. Conversely, a correlation of zero is neither necessary nor sufficient for a low similarity (or for no similarity at all).
Third, all similarity measures can be used for statistical inference, not only the Pearson correlation. One way to do this is to use a statistical technique called bootstrapping. Bootstrapping is a generally applicable computer-intensive technique that can be used to calculate standard errors and confidence intervals and to test hypotheses. It replaces traditional statistical analysis by a considerable amount of computation and can be applied to problems for which a theoretical analysis either is too complicated or requires very demanding assumptions.
沒有留言:
張貼留言