2013年3月19日 星期二

Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2), 251-263.

Klavans, R., & Boyack, K. W. (2006). Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology57(2), 251-263.

information visualization


本研究提出相關性(relatedness)測量的評量架構,並利用這個評量架構判定六種交互引用(intercitation)和四種共被引(cocitation)的相關性測量方式以及應用到視覺化演算法的結果,六種應用於交互引用的相關性測量方式包括原始的引用次數、cosine指標、Jaccard指標、Pearson相關係數、Pudovkin & Fuseler (1995)和Pudovkin & Garfield (2002)根據期刊引用應用所提出的相關因素(relatedness factor)、以及本研究提出的由cosine指標減去期望的cosine值(expected cosine value)的K50指標,四種應用於交互引用的相關性測量方式則有原始的共被引次數、cosine指標、Pearson相關係數和K50指標。本研究所提出來的評量架構是以一組已經分類的物件為基礎,評估各種相關性測量方式的準確度(accuracy)、覆蓋率(coverage)、可擴展性(scalablity)和強健性(robustness)。例如對期刊間的相關性進行測量時,可以利用ISI的期刊分類為評估物件間相關性的基礎。準確度是指能夠正確地判斷對象間是否相關,可以再區分成區域準確度(local accuracy)與整體準確度(global accuracy),區域準確度是指物件與其他最接近物件是否能夠被正確地放置與排序的趨勢,也就是在同一分類的物件是否具有比在不同分類的物件更高的相關性,整體準確度是指分類之間的位置與排序等關係。覆蓋率則是指的是某一個閾值(threshold)以上的相關性所得到的正確分類結果占所有應有的分類結果的比例。可擴展性是指這種測量方式能否應用於非常大型的資料集合,與測量方式的計算量有關。強健性則是指將相關性測量的結果應用到視覺化演算法進行維度縮減(dimensional reduction)處理後,物件在產生圖形上的映射點間的相關性能否保留原先測量方式的相關性的關係。這四種評估指標彼此間有所關連,例如較大的覆蓋率通常會得到較不準確的結果;而如果希望得到較準確的結果,採用較多計算量的測量方式便無法達到較好的可擴展性;而在維度縮減後,也可能導致準確度變差;最後,以交互引用資料做為相關性測量方式的輸入,能夠利用最近期的資料,獲得較準確的結果,但以共被引資料做為輸入,則能夠包含不在分析期刊中的來源。本研究以2000年ISI的SCIE(science citation index extended)和SSCI(social science citation index)的期刊交互引用和共被引資料為例,共計7121筆期刊,期間的交互引用資料超過1624萬筆,以這些資料計算上述的10種相關性測量方式,並且應用VxOrd進行視覺化計算。研究結果發現:在各種覆蓋率之下,交互引用的cosine(IC-Cosine)和以cosine為基礎的K50(IC-K50)兩種測量方式比其他的測量方式在預測分類時較為準確,相較於需要較多計算資源的Pearson相關係數在應用上較為可行。不論是交互引用或是共被引資料的原始次數在這個利用分類做為準確率評估標準的研究裡,都不理想。此外,交互引用的各種測量方式大多比共被引資料的測量方式更為準確。並且最為特別的是經過VxOrd的視覺化處理,各種測量方式都有比原先的測量方式得到更高的準確率。

The authors propose a new framework for assessing the performance of relatedness measures and visualization algorithms that contains four factors: accuracy, coverage, scalability, and robustness.

This method was applied to 10 measures of journal–journal relatedness to determine the best measure. The 10 relatedness measures were then used as inputs to a visualization algorithm to create an additional 10 measures of journal–journal relatedness based on the distances between pairs of journals in two-dimensional space. This second step determines robustness (i.e., which measure remains best after dimension reduction).

Results show that, for low coverage (under 50%), the Pearson correlation is the most accurate raw relatedness measure. However, the best overall measure, both at high coverage, and after dimension reduction, is the cosine index or a modified cosine index. Results also showed that the visualization algorithm increased local accuracy for most measures.

The two main groups of measures are intercitation measures, or those based on one journal citing another, and cocitation measures, which are based on the number of times two journals are listed together in a set of reference lists.

Although raw frequency has been used for both journal citation (Boyack, Wylie, & Davidson, 2002) and journal cocitation analysis studies in the past (McCain, 1991), it is rarely used today.

For intercitation studies, normalized frequencies such as the cosine, Jaccard, Dice, or Ochiai indexes (Bassecoulard & Zitt, 1999) are very simple to calculate, and give much better results than raw frequencies (Gmur, 2003).

A new type of normalized frequency, specific to journals, has been proposed recently (Pudovkin & Fuseler, 1995; Pudovkin & Garfield, 2002). This new relatedness factor (RF), an intercitation measure, is unique in that it is designed to account for varying journal sizes, thus giving a more semantic or topic-oriented relatedness than other measures.

The Pearson correlation coefficient, known as Pearson’s r, is a commonly used measure for journal intercitation (Leydesdorff, 2004a, 2004b), journal cocitation (Ding, Chowdhury, & Foo, 2000; McCain, 1992, 1998; Morris & McCain, 1998; Tsay, Xu, & Wu, 2003), document cocitation (Chen, Cribbin, Macredie, & Morar, 2002; Gmur, 2003; Small, 1999; Small, Sweeney, & Greenlee, 1985), and author cocitation studies (cf. White, 2003; White & McCain, 1998).

Lists of relatedness measurements are rarely analyzed directly, but are used as input to an algorithm that reduces the dimensionality of the data, and arranges the tokens on a 2-D plane. The distance between any two tokens on the 2-D plane is thus a secondary (or reduced) measure of relatedness.

Validation of relatedness measures has received little attention over the years. Most of these efforts have been to compare 2-D maps obtained from MDS with some sort of expert perceptions of the subject field.

Only one study has compared citation-based relatedness measures. Gmur (2003) compared six different relatedness measures based on the cocitation counts of 194 highly cited documents in the field of organization science. The measures included raw frequency, three forms of normalized frequency, Pearson’s r, and loadings from factor analysis. The bases for comparison were network-related metrics such as cluster numbers, sizes, densities, and differentiation. Results were strongly influenced by similarity type. For optimum definition of the different areas of research within a field, and their relationships, clustering based on Pearson’s r or on the combination of two types of normalized frequency worked best.

Accuracy refers to the ability of a relatedness measure to identify correctly whether tokens (e.g., journals, documents, authors, or words) are related.

Local accuracy refers to the tendency of the nearest tokens to be correctly placed or ranked. Ideally, local accuracy is measured from the perspective of each individual token. For authors, the question might be whether an author would agree with the ranking of the 10 most closely related authors. For journals, the question might be whether the closest journals were in the same discipline. For papers, the question might be whether the closest papers were on the same topic.

Global accuracy refers to the tendency for groups of tokens to be correctly placed or ranked, and requires that the tokens be clustered.

The assessment of accuracy requires some sort of independent data to use as a basis of comparison.

Coverage helps to assess the impact of thresholds on accuracy. In this analysis, thresholds are used to identify all relationships that are at or above a certain level of accuracy. Very high thresholds of relatedness will tend to identify the relationship between a few tokens, lower thresholds will include more tokens, but the level of accuracy will likely be lower.

Scalability refers to the ability of a measure (or a derived measure from a visualization program) to be applied to extremely large databases.

Robustness refers to the ability of a measure to remain accurate when subjected to visualization algorithms. Visualization algorithms reduce the dimensionality of the data, and it is reasonable to assume that the reduction in dimensionality will affect the accuracy of the measure. While the visualizations allow a user to gain insights into the underlying structure of the data, these insights should be qualified by an assessment of the concurrent loss of accuracy.

One expectation is that greater coverage will result in lower accuracy.

Another expectation is that the measures that utilize more data and more calculations will be more accurate but less scalable.

A third expectation is that accuracy will drop when a measure is subjected to dimension-reduction techniques because the underlying data is inherently multidimensional.

The last tradeoff refers to the choice of intercitation versus cocitation measures. On the one hand, intercitation-based measures should be more accurate because the data are more current (current year to past years rather than past-year pairs). On the other hand, cocitation measures can cover far more sources.

The data used to calculate relatedness measures for this study were based on intercitation and cocitation frequencies obtained from the ISI annual file for the year 2000. Science Citation Index Expanded (SCIE; Thomson ISI, 2001a) and Social Science Citation Index (SSCI; Thomson ISI, 2001b) data files were merged, resulting in 1.058 million records from 7349 separate journals. Of the 7349 journals, we limited our analysis to the 7121 journals that appeared as both citing and cited journals. There were a total of 16.24 million references between pairs of the 7121 journals.

The resulting journal–journal citation frequency matrix was extremely sparse (98.6% of the matrix has zeros). While there was a great deal more cocitation frequency information, the journal–journal cocitation frequency matrix was also sparse (93.6% of the matrix has zeros).

The 10 relatedness measures used in this study are given below, along with their equations. The six intercitation measures are raw frequency, Cosine, Jaccard, Pearson’s r, the recently introduced average relatedness factor of Pudovkin and Garfield (2002), and a new normalized frequency measure that we introduce here, K50. ... Note that the new measure, K50, is simply the cosine index minus an expected cosine value. ... The four cocitation measures are raw frequency, cosine, Pearson’s r, and the cocitation version of the K50 measure.

As mentioned above, for each of the 10 relatedness measures, a dimension reduction was done using VxOrd. The process for calculating “re-estimated measures” is as follows. First, 2-D coordinates were calculated for each of the 7121 journals using VxOrd (cf. Figure 2). Next, the distances between each pair of journals (on the 2-D plane) were calculated for the entire set and used as the re-estimated measures of relatedness.

The IC-Pearson measure is the most accurate for higher absolute levels of relatedness (up to a rank of ~85,000). As ranked relatedness increases, the curves for all but the IC-Raw measure converge. IC-Cosine, IC-K50, and IC-Jaccard measures generate nearly identical results over the entire relatedness range
up to a rank of ~125,000.

The CC-Pearson measure is the best of the four up to a rank of ~350,000, and then
drops below the CC-Cosine and CC-K50. The CC-K50 is slightly more accurate than the CC-Cosine, and the raw frequency measure, CC-Raw, gives the worst results by far.

Figure 4a shows that for the intercitation measures, the IC-Cosine and IC-K50 measures cover more journals than the other measures over the entire range of rank relatedness. The IC-Jaccard and IC-RFavg measures have the next highest coverage, followed by the IC-Pearson. The IC-Raw covers the fewest journals over most of the range.

The CC-Cosine and CC-K50 have the highest coverage, followed by the CC-Pearson. Once again, raw frequency gives the worst results.

The IC-Pearson measure is more accurate for up to a coverage of 0.58, while the IC-Cosine and IC-K50 are more accurate for coverage past 0.58. Note that, excepting the raw frequency measures, both of which do poorly, the intercitation measures are more accurate than the cocitation measures.

First, the IC-Cosine, IC-K50, and IC-Jaccard measures all have roughly comparable accuracy over the entire range of coverage. The IC-K50 measure is slightly more accurate than the others from 20–50% coverage, while the IC-Cosine is the most accurate from 50–90% coverage. The IC-Pearson measure remains below these three over the entire coverage range.

Second, the intercitation measures are more accurate than the cocitation measures in all cases.

Third, the Pearson measures are less accurate than the cosine measures for both the intercitation and cocitation data.

Also, note that the re-estimated K50 measures are essentially identical to the cosine measures for both the intercitation and cocitation data. Any differences at a particular coverage value are small enough to justify using the cosine value, which requires less calculation. It appears that, although the K50, by virtue of subtracting out the expected values, gives different individual similarity values and rankings, the aggregate effect on overall accuracy is minimal.

The most striking result comes from a comparison of the results of Figures 5 and 6, namely that the overall accuracy for all re-estimated measures is higher than for the raw measures over nearly the entire coverage range. This is an extremely counterintuitive finding, given the prevailing and common belief that information is lost when dimensionality is reduced.

Three of the intercitation measures (IC-Cosine, IC-K50, and IC-Jaccard) perform similarly, all with high-accuracy values at the both the 50% and 95% coverage levels.

All of the intercitation measures are limited to use within the citing journal set. If coverage outside the citing journal set is desired, cocitation measures can be used. Of these, the new measure introduced in this paper, CC-K50, is slightly better than the Cosine at high-coverage levels. Both the CC-Cosine and CC-K50 are clearly better than the Pearson correlation, both in terms of accuracy, and in that they do not require n(square)  calculations, and thus scale to much larger sets than the Pearson.

First, we expected the Pearson correlation to provide the best results. The reason for this expectation is that the Pearson correlation uses more information in its construction (nearly the entire intercitation or cocitation matrix) than do the other measures. Pearson correlations allow for the influence of other parties. On the other hand, the other measures only use a small amount of the data in the matrix, and tend to limit their focus to the relationship between the two journals in
question.

The second surprise was the increase in performance from the visualization software. We expected the performance to deteriorate due to the simple rule of thumb that reducing data to two dimensions requires tradeoffs that would result in lower accuracy.

The improvement in performance may be explained by the peculiarities of the VxOrd force directed algorithm. VxOrd balances attractive forces between nodes (the similarity values) with those of a repulsive grid that tries to force all nodes apart. It also cuts edges once the similarity-to-distance ratio falls below a threshold, and in most cases cuts about 50% of the original edges, thus leaving edges only where particularly strong similarities exist among a set of nodes. These dominant similarities are likely to be very accurate on the whole, and when concentrated by pruning the less accurate edges, may increase the overall accuracy of the solution.

沒有留言:

張貼留言