van Eck, N. J. and Waltman, L. (2009). How to normalize cooccurrence data? An analysis of some well-known similarity measures. Journal of American Society for Information Science and Technology, 60, 1635-1651.
Information Visualization
以共現資料(cooccurrence data)作為物件之間相似性的測量可以分為間接方式(indirect approach)和直接方式(direct approach)兩種。
間接方式藉由比較物件的共現特徵檔(cooccurrence profile)導引出相似性。此處每一物件的共現特徵檔是一組包含此一物件與其他物件的共現次數所組成的向量(vector)。常見的間接式共現資料相似性計算為Pearson相關係數(Pearson correlation),特別是科學計量學(scientometrics)研究裡的作者共被引分析(author cocitation analysis),例如McCain(1990)、White & Griffith(1981)和White & McCain(1998)等。
關於Pearson相關係數,Ahlgren , Jarneving, & Rousseau (2003)和Van Eck & Waltman(2008)則認為這項相似性計算並沒有良好的理論性質(theoretical properties),因此不應使用。其他如Bhattacharyya距離(Bhattacharyya distance)、cosine和Jensen-Shannon距離(Jensen-Shannon distance)等間接式相似性計算則有較好的理論性質,雖然Schneider and Borlund(2007)的研究認為間接式相似性計算在統計上較不方便,但本研究則認為從大量資料計算而獲得的間接式相似性計算仍具有能牽涉較少統計不確定(statistical uncertainty)的優點。間接式相似性計算的另一項問題是可能會出現兩個物件彼此間不曾共同出現,但其共現特徵檔非常相像而有很高的相似性的不合理現象,因此本研究認為以共現次數作為衡量、並利用兩個物件出現次數加以調整的直接式相似性計算較為合理。
間接方式藉由比較物件的共現特徵檔(cooccurrence profile)導引出相似性。此處每一物件的共現特徵檔是一組包含此一物件與其他物件的共現次數所組成的向量(vector)。常見的間接式共現資料相似性計算為Pearson相關係數(Pearson correlation),特別是科學計量學(scientometrics)研究裡的作者共被引分析(author cocitation analysis),例如McCain(1990)、White & Griffith(1981)和White & McCain(1998)等。
關於Pearson相關係數,Ahlgren , Jarneving, & Rousseau (2003)和Van Eck & Waltman(2008)則認為這項相似性計算並沒有良好的理論性質(theoretical properties),因此不應使用。其他如Bhattacharyya距離(Bhattacharyya distance)、cosine和Jensen-Shannon距離(Jensen-Shannon distance)等間接式相似性計算則有較好的理論性質,雖然Schneider and Borlund(2007)的研究認為間接式相似性計算在統計上較不方便,但本研究則認為從大量資料計算而獲得的間接式相似性計算仍具有能牽涉較少統計不確定(statistical uncertainty)的優點。間接式相似性計算的另一項問題是可能會出現兩個物件彼此間不曾共同出現,但其共現特徵檔非常相像而有很高的相似性的不合理現象,因此本研究認為以共現次數作為衡量、並利用兩個物件出現次數加以調整的直接式相似性計算較為合理。
直接式相似性計算可以再分為以兩個物件出現文件集合重疊為基礎的集合理論相似性測量(set theoretic similarity measures)和估計實際共現次數和預期共現次數比率的機率相似性測量(probabilistic similarity measures),前者包括cosine 、Jaccard指標(the Jaccard index)和包含指標(the inclusion index),後者則以van Eck & Waltman(2007)和van Eck, Waltman, van den Berg, & Kaymak(2006)所提出的聯想強度(association strength)為代表。
就理論分析而言,所有的集合理論相似性測量都是廣化相似指標(the generalized similarity index)的某一種特例,至少是某一種特例的單調相關(monotonic relatedness)。廣化相似指標是兩個物件共同出現的次數和這兩個物件出現次數的冪次平均(power mean)的比例,而單調相關的條件則是指在任何單調轉換(monotonic transformations)的情況下物件的大小順序依然不會改變。
機率相似測量則是以兩物件實際的共現次數(cij)和兩物件出現為獨立事件的預期共現次數(mpipj)的比率進行估計,也就是cij/mpipj,此處pi和pj分別為兩個物件i和j出現的機率,而m是資料庫內的文獻數,另外pi和pj分別等於si/m和sj/m,si和sj是物件i和j的出現次數。進一步推導,cij/mpipj=mcij/sisj。如果mcij/sisj>1,此時可以推論兩個物件的共同出現情形並非隨機,所以其間相似性較大。
就理論分析而言,所有的集合理論相似性測量都是廣化相似指標(the generalized similarity index)的某一種特例,至少是某一種特例的單調相關(monotonic relatedness)。廣化相似指標是兩個物件共同出現的次數和這兩個物件出現次數的冪次平均(power mean)的比例,而單調相關的條件則是指在任何單調轉換(monotonic transformations)的情況下物件的大小順序依然不會改變。
機率相似測量則是以兩物件實際的共現次數(cij)和兩物件出現為獨立事件的預期共現次數(mpipj)的比率進行估計,也就是cij/mpipj,此處pi和pj分別為兩個物件i和j出現的機率,而m是資料庫內的文獻數,另外pi和pj分別等於si/m和sj/m,si和sj是物件i和j的出現次數。進一步推導,cij/mpipj=mcij/sisj。如果mcij/sisj>1,此時可以推論兩個物件的共同出現情形並非隨機,所以其間相似性較大。
在科學計量學的研究範疇裡,以共現次數來衡量物件間的相似性通常需要進行正規化(normalization),使得測量的結果較不受物件出現次數差異的影響。本研究認為從理論的觀點,物件間的共現次數大小不僅受到物件間相似性大小的影響,同時也會受到物件出現次數大小的影響,也就是所謂的相似性效應(similarity effect)和規模效應(size effect)。共現分析便是運用相似性效應的概念,如果兩種物件間的相似性愈大,它們間也會有較大的共現次數;然而物件出現的次數愈大,也同樣會使得物件間具有較大共現次數的可能性增加,因此正規化的目的便是去除規模效應的影響。
集合理論相似性測量是以兩個物件出現次數的冪次平均去除規模效應的影響,其正規化的效果不如以機率相似性測量來得好。
本研究並且以實際案例中的作者共被引次數、期刊共被引次數與詞語的共現次數對cosine 、Jaccard指標、包含指標和聯想強度等測量方式進行相關檢定,結果發現只有cosine 和Jaccard指標彼此間有很強的相關性,其他指標間的相關性都不大。
另外也對各種指標受到規模效應的影響進行測量,發現僅有屬於機率相似性測量的聯想強度可以不受規模效應的影響,因此這種相似性的測量方式較適合應用在科學計量學的研究。
集合理論相似性測量是以兩個物件出現次數的冪次平均去除規模效應的影響,其正規化的效果不如以機率相似性測量來得好。
本研究並且以實際案例中的作者共被引次數、期刊共被引次數與詞語的共現次數對cosine 、Jaccard指標、包含指標和聯想強度等測量方式進行相關檢定,結果發現只有cosine 和Jaccard指標彼此間有很強的相關性,其他指標間的相關性都不大。
另外也對各種指標受到規模效應的影響進行測量,發現僅有屬於機率相似性測量的聯想強度可以不受規模效應的影響,因此這種相似性的測量方式較適合應用在科學計量學的研究。
Basically, there are two approaches that can be taken to derive similarities from cooccurrence data. We refer to these approaches as the direct and the indirect approach, but the approaches are also known as the local and the global approach (Ahlgren, Jarneving, & Rousseau, 2003; Jarneving, 2008). ...The indirect approach to derive similarities from cooccurrence data relies on cooccurrence profiles. The cooccurrence profile of an object is a vector that contains the number of cooccurrences of the object with each other object. Indirect similarity measures determine the similarity between two objects by comparing the cooccurrence profiles of the objects.
Direct similarity measures determine the similarity between two objects by taking the number of cooccurrences of the objects and adjusting this number for the total number of occurrences or cooccurrences of each of the objects. ... Usually, when a direct similarity measure is applied to cooccurrence data, the purpose is to normalize the data, that is, to correct the data for differences in the total number of occurrences or cooccurrences of objects.
An interesting finding is that despite their popularity, the cosine and the Jaccard index turn out not to be appropriate measures for normalization purposes. We argue that an appropriate measure for normalizing cooccurrence data is the association strength (Van Eck & Waltman, 2007; Van Eck, Waltman, Van den Berg, & Kaymak, 2006), also referred to as the proximity index (e.g., Peters & Van Raan, 1993a; Rip & Courtial, 1984)
or the probabilistic affinity index (e.g., Zitt, Bassecoulard, & Okubo, 2000).
or the probabilistic affinity index (e.g., Zitt, Bassecoulard, & Okubo, 2000).
Indirect similarity measures, also known as global similarity measures (Ahlgren et al., 2003; Jarneving, 2008), determine the similarity between two objects i and j by comparing the ith and the jth row(or column) of the cooccurrence matrix C. The more similar the cooccurrence profiles in these two rows (or columns) of C, the higher the similarity between i and j. Indirect similarity measures are especially popular for author cocitation analysis (e.g., McCain, 1990; White & Griffith, 1981; White & McCain, 1998) and journal cocitation analysis (e.g., McCain, 1991).
Direct similarity measures determine the similarity between two objects i and j by taking the number of cooccurrences of i and j and adjusting this number for the total number of occurrences or cooccurrences of i and the total number of occurrences or cooccurrences of j. We note that in some studies similarities between objects are determined by comparing columns of the occurrence matrix O (e.g., Leydesdorff & Vaughan, 2006; Schneider, Larsen, & Ingwersen, 2009). In most cases, this approach is mathematically equivalent to the use of a direct similarity measure.
Monotonic relatedness of direct similarity measures is important because certain multivariate analysis techniques that are frequently used in scientometric research are insensitive to monotonic transformations of similarities.
The association strength defined in Equation 6 is used by Van Eck and Waltman (2007) and Van Eck et al. (2006). Under various names, the measure is also used in a number of other studies. Hinze (1994), Leclerc and Gagné (1994), Peters and Van Raan (1993a), and Rip and Courtial (1984) refer to the measure as the proximity index, while Leydesdorff (2008) and Zitt et al. (2000) refer to it as the probabilistic affinity (or activity) index. Luukkonen et al. (1992, 1993) also employ the measure, but in their work, it does not have a name.
The association strength is proportional to the ratio between, on the one hand, the observed number of cooccurrences of objects i and j and, on the other hand, the expected number of cooccurrences of objects i and j under the assumption that occurrences of i and j are statistically independent.
The cosine defined in Equation 7 equals the ratio between, on the one hand, the number of times that objects i and j are observed together and, on the other hand, the geometric mean of the number of times that object i is observed and the number of times that object j is observed.
The cosine seems to be the most popular direct similarity measure in the field of scientometrics. Frequently cited studies in which the measure
is used include Braam, Moed, and Van Raan (1991a, 1991b), Klavans and Boyack (2006a), Leydesdorff (1989), Peters and Van Raan (1993b), Peters, Braam, and Van Raan (1995), Small (1994), Small and Sweeney (1985), and Small, Sweeney, and Greenlee (1985).
is used include Braam, Moed, and Van Raan (1991a, 1991b), Klavans and Boyack (2006a), Leydesdorff (1989), Peters and Van Raan (1993b), Peters, Braam, and Van Raan (1995), Small (1994), Small and Sweeney (1985), and Small, Sweeney, and Greenlee (1985).
Examples of the use of the inclusion index defined in Equation 8 can be found in thework of Kostoff, Del Rìo, Humenik, Garcìa, and Ramìrez (2001), McCain (1995), Peters and Van Raan (1993a), Rip and Courtial (1984), Tijssen (1992, 1993), and Tijssen and Van Raan (1989).
The Jaccard index defined in Equation 9 equals the ratio between, on the one hand, the number of times that objects i and j are observed together and, on the other hand, the number of times that at least one of the two objects is observed.
Small uses the Jaccard index in his early work on cocitation analysis (e.g., Small, 1973, 1981; Small & Greenlee, 1980). Other work in which the Jaccard index is used includes Heimeriks, Hörlesberger, and Van den Besselaar (2003), Kopcsa and Schiebel (1998), Peters and Van Raan (1993a), Peters et al. (1995), Rip and Courtial (1984), Van Raan and Tijssen (1993), Vaughan (2006), and Vaughan and You (2006).
Boyack et al. (2005), Gmür (2003), Klavans and Boyack (2006a), Leydesdorff (2008), Luukkonen et al. (1993), and Peters and Van Raan (1993a) report results of empirical comparisons of different measures. Theoretical analyses of relations between different measures can be found in the work of Egghe (2009) and Hamers et al. (1989). Egghe and Rousseau (2006) also theoretically studied properties of various measures. Schneider and Borlund (2007a, 2007b) provide an extensive discussion of the issue of comparing different measures.
It turns out that there is a fundamental difference between the cosine, the inclusion index, and the Jaccard index, on the one hand, and the association strength, on the other hand. The first three measures all belong to the class of set-theoretic similarity measures, while the last measure belongs to the class of probabilistic similarity measures.
There are a number of properties of which we believe that it is natural to expect that any set-theoretic similarity measure S(cij , si, sj) has them. Three of these properties are given below.
Property 1. If cij =0, then S(cij , si, sj) takes its minimum value.
Property 2. For all α>0, S(αcij , αsi, αsj)=S(cij , si, sj).
Property 3. If s'i >si and cij >0, then S(cij , s'i, sj)<S(cij ,si, sj).
Property 2. For all α>0, S(αcij , αsi, αsj)=S(cij , si, sj).
Property 3. If s'i >si and cij >0, then S(cij , s'i, sj)<S(cij ,si, sj).
The inclusion index defined in Equation 8 is also not a set-theoretic similarity measure. This is because the inclusion index does not have Property 3. However, the inclusion index does have the following property, which is a weakened version of Property 3.
Property 4. If s'i >si and cij >0, then S(cij , s'i, sj)≤S(cij , si, sj).
Property 4. If s'i >si and cij >0, then S(cij , s'i, sj)≤S(cij , si, sj).
Property 5. If S(cij , si, sj) takes its minimum value, then cij =0.
Property 6. If cij =si =sj , then S(cij, si, sj) takes its maximum value.
Property 7. If S(cij , si, sj) takes its maximum value, then cij =si =sj .
Property 8. For allα>0, if cij <si or cij < sj , then S(cij +α, si +α, sj +α)>S(cij , si, sj).
Property 6. If cij =si =sj , then S(cij, si, sj) takes its maximum value.
Property 7. If S(cij , si, sj) takes its maximum value, then cij =si =sj .
Property 8. For allα>0, if cij <si or cij < sj , then S(cij +α, si +α, sj +α)>S(cij , si, sj).
Proposition 1. All set-theoretic similarity measures S(cij , si, sj) have Properties 5, 6, 7, and 8.
We note that weak set-theoretic similarity measures need not have Properties 5, 7, and 8. They do have Property 6.
We note that weak set-theoretic similarity measures need not have Properties 5, 7, and 8. They do have Property 6.
Property 9. If s'is'j >sisj and cij > 0, then S(cij , s'i, s'j) < S(cij ,si, sj). If s'is'j=sisj , then S(cij , s'i, s'j)=S(cij , si, sj).
Property 10. If s'i+s'j >si +sj and cij >0, then S(cij , s'i, s'j)<S(cij , si, sj). If s'i+s'j=si +sj , then S(cij , s'i, s'j)=S(cij , si, sj)
Property 10. If s'i+s'j >si +sj and cij >0, then S(cij , s'i, s'j)<S(cij , si, sj). If s'i+s'j=si +sj , then S(cij , s'i, s'j)=S(cij , si, sj)
It is easy to see that these properties both imply Property 3. Hence, Properties 9 and 10 are both stronger than Property 3. It can further be seen that the cosine has Property 9 and that the Jaccard index has Property 10. The following two propositions indicate the importance of Properties 9 and 10.
Proposition 2. All set-theoretic similarity measures S(cij , si, sj) that have Property 9 are monotonically related to the cosine defined in Equation 7.
Proposition 3. All set-theoretic similarity measures S(cij , si, sj) that have Property 10 are monotonically related to the Jaccard index defined in Equation 9.
Proposition 3. All set-theoretic similarity measures S(cij , si, sj) that have Property 10 are monotonically related to the Jaccard index defined in Equation 9.
It follows from Proposition 2 that Properties 1, 2, and 9 characterize the class of all set-theoretic similarity measures that are monotonically related to the cosine. Likewise, it follows from Proposition 3 that Properties 1, 2, and 10 characterize the class of all set-theoretic similarity measures that are monotonically related to the Jaccard index.
Property 11. If min(s'i, s'j)>min(si, sj) and cij >0, then S(cij , s'i, s'j) < S(cij , si, sj). If min(s'i, s'j)=min(si, sj), then S(cij , s'i, s'j)=S(cij , si, sj).
This property implies Property 4. Together with Properties 1 and 2, Property 11 characterizes the class of all weak settheoretic similarity measures that are monotonically related to the inclusion index.
Proposition 4. All weak set-theoretic similarity measures S(cij , si, sj) that have Property 11 are monotonically related to the inclusion index defined in Equation 8.
The generalized similarity index equals the ratio between, on the one hand, the number of times that objects i and j are observed together and, on the other hand, a power mean of the number of times that object i is observed and the number of times that object j is observed.
An interesting property of the generalized similarity index is that, for various values of p, the index reduces to a well-known (weak or non-weak) set-theoretic similarity measure.
Proposition 5. For all finite values of the parameter p, the generalized similarity index defined in Equation 11 is a set-theoretic similarity measure.
This proposition states that the generalized similarity index describes an entire class of set-theoretic similarity measures. Each member of this class corresponds with a particular value of p. Only in the limit case in which p→±∞, the generalized similarity index is not a set-theoretic similarity measure. In this limit case, the generalized similarity index is a weak set-theoretic similarity measure.
We are interested in direct similarity measures S(cij , si, sj) that have the following two properties.
Property 12. If s1 = s2 = . . . =sn, then S(cij , si, sj)=αcij for all i =j and for some α>0.
Property 13. For all α>0, S(αcij , αsi, sj)=S(cij , si, sj).
Property 12. If s1 = s2 = . . . =sn, then S(cij , si, sj)=αcij for all i =j and for some α>0.
Property 13. For all α>0, S(αcij , αsi, sj)=S(cij , si, sj).
Property 12 requires that, if all objects occur equally frequently, the similarity between two objects is proportional to the number of cooccurrences of the objects.
Definition 6.A probabilistic similarity measure is defined as a direct similarity measure S(cij , si, sj) that has Properties 12 and 13.
The number of cooccurrences of two objects can be seen as the result of two independent effects. We refer to these effects as the similarity effect andthe size effect. The similarity effect is the effect that, other things being equal, more similar objects have more cooccurrences. The size effect is the effect that, other things being equal, an object that occurs more frequently has more cooccurrences with other objects.
If one is interested in the similarity between two objects, the number of cooccurrences of the objects is in general not an appropriate measure. This is because, due to the size effect, the number of cooccurrences is likely to give a distorted picture of the similarity between the objects (see also Waltman & Van Eck, 2007).
Usually, when a direct similarity measure is applied to cooccurrence data, the aim is to correct the data for the size effect.
This means that the increase in the number of cooccurrences of i with each other object is completely due to the size effect and has not been caused by the similarity effect. Taking this into account, it is natural to expect that the similarities between i and the other objects remain unchanged. Property 13 implements this idea.
Let pi denote the probability that object i occurs in a randomly chosen document. It is clear that pi =si/m. If two objects i and j occur independently of each other, the probability that they cooccur in a randomly chosen document equals pij =pipj. The expected number of cooccurrences of i and j then equals eij =mpij =mpipj =sisj/m. A natural way to measure the similarity between i and j is to calculate the ratio between on the one hand the observed number of cooccurrences of i and j and on the other hand the expected number of cooccurrences of i and j under the assumption that i and j occur independently of each other (for a similar argument in a more general context, see De Solla Price, 1981). This results in a measure that equals cij/eij . This measure has a straightforward probabilistic interpretation. If cij/eij >1, i and j cooccur more frequently than would be expected by chance. If, on the other hand, cij/eij <1, i and j cooccur less frequently than would be expected by chance. It is easy to see that cij/eij =mSA(cij , si, sj). Hence, the measure cij/eij is proportional to the association strength and, consequently, belongs to the class of probabilistic similarity measures.
When looking in more detail at the scatter plots in Figures 1 and 2, it can be seen that the similarity measures that are strongest related to each other are the cosine and the Jaccard index. The same observation can be made in Tables 4, 5, and 6. The relatively strong relation between the cosine and the Jaccard index has been observed before and is discussed by Egghe (2009), Hamers et al. (1989), and Leydesdorff (2008).
As pointed out by Schneider and Borlund (2007a), from a statistical perspective, the use of an indirect similarity measure is a quite unconventional approach. However, despite being unconventional, we do not believe that the approach has any fundamental statistical problems. Appropriate indirect similarity measures include the Bhattacharyya distance, the cosine, and the Jensen-Shannon distance. These measures are known to have good theoretical properties (Van Eck & Waltman, 2008).
A very popular indirect similarity measure, especially for author cocitation analysis (e.g., McCain, 1990; White & Griffith, 1981; White & McCain, 1998), isthe Pearson correlation. However, this measure does not have good theoretical properties and should therefore not be used (Ahlgren et al., 2003; Van Eck & Waltman, 2008).
The chi-squared distance, which is proposed as an indirect similarity measure by Ahlgren et al. (2003), also does not have all the theoretical properties that we believe an appropriate indirect similarity measure should have (Van Eck &Waltman, 2008).
In general, we believe the notion of direct similarity to be closer to the intuitive idea of similarity. Consider two objects that do not cooccur at all but that have quite similar cooccurrence profiles. The direct similarity between the objects will be very low, while the indirect similarity between the objects will be quite high. However, a high similarity between two objects that do not cooccur can be rather counterintuitive, at least in certain contexts. For that reason, we believe that in general the notion of direct similarity is more natural than the notion of indirect similarity.
Compared with direct similarity measures, indirect similarity measures are calculated based on a larger amount of data and most likely they therefore involve less statistical uncertainty.
Direct similarity measures determine the similarity between two objects by taking the number of cooccurrences of the objects and adjusting this number for the total number of occurrences of each of the objects. In scientometric research, when a direct similarity measure is applied to cooccurrence data, the aim usually is to normalize the data, that is, to correct the data for differences in the number of occurrences of objects.
We argue that cooccurrence data should always be normalized using a probabilistic similarity measure. Other direct similarity measures are not appropriate for normalization purposes. In particular, set-theoretic similarity measures should not be used to normalize cooccurrence data.
As we discussed earlier in this article, probabilistic similarity measures correct for the size effect. This follows from Property 13. Set-theoretic similarity measures do not have this property, and they therefore do not properly correct for the size effect.
As a consequence, set-theoretic similarity measures have, on average, higher values for objects that occur more frequently (see also Luukkonen
et al., 1993; Zitt et al., 2000). The values of probabilistic similarity measures, on the other hand, do not depend on how frequently objects occur.
et al., 1993; Zitt et al., 2000). The values of probabilistic similarity measures, on the other hand, do not depend on how frequently objects occur.
On the one hand, there are set-theoretic similarity measures, which can be interpreted as measures of the relative overlap of two sets. On the other hand, there are probabilistic similarity measures, which can be interpreted as measures of the deviation of observed cooccurrence frequencies from expected cooccurrence frequencies under an independence assumption.
In the left panel of the figure, similarities are determined using a probabilistic similarity measure, namely the association strength. In this panel, there is no substantial correlation between the number of occurrences of a term and the average similarity of a term (r=−0.069, ρ=−0.029). This is very different in the right panel, in which similarities are determined using a set-theoretic similarity measure, namely the cosine. (The inclusion index and the Jaccard index yield similar results.) In the right panel, there is a strong positive correlation between the number of occurrences of a term and the average similarity of a term (r =0.839, ρ=0.882). Results such as those shown in the right panel clearly indicate that set-theoretic similarity measures do not properly correct for the size effect and, consequently, do not properly normalize cooccurrence data.
沒有留言:
張貼留言