2015年4月9日 星期四

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

本研究比較兩種以內容為基礎的期刊分類以及兩種以演算法為基礎的期刊分類。兩種以內容為基礎的期刊分類分別是ISI的主題分類(Subject Categories)以及Glänzel and Schubert (2003)的領域/次領域分類(field/subfield classification)SOOI,兩種以演算法為基礎的期刊分類則分別是Blondel et al. (2008)提出的展開式(unfolding)社群偵測(community detection)法以及Rosvall, and Bergstrom (2008)的隨機漫步(random walk)矩陣分解(matrix decomposition)法。若是利用以內容為基礎的分類,期刊可以同時指定多個類別;以演算法為基礎的期刊分類則可以使類別內的引用(within-category citation)對類別內的引用(between-category citation)的比率最大化,也就是將期刊彼此之間的引用資料排列成矩陣,經過適當的行列排列後,使得主要對角線(principal diagonal)附近的數值較大,而其他地方則接近0。

各種分類的相關統計資料如表1所示:


由於以內容為基礎的分類方法具有多重分類特性以及以演算法為基礎的分類方法以矩陣分解為目的,從表1上可以觀察到兩種現象:1) 在類別內期刊數的中位數方面,可以看到以內容為基礎的兩種期刊分類方法較以演算法為基礎的期刊分類方法來得多,可配合圖1每個類別期刊數的分佈在0.50上所呈現的情形。另外,圖1也可發現四種分類方法都是對數常態分布(log normal distribution),也就是在這四種分類方法中,相對少數的類別擁有大量的期刊,然而許多類別卻只有少量期刊。並且以演算法為基礎的分類方法比以內容為基礎的分類方法更偏斜(more skewed),也較是上述的情況更嚴重。隨機漫步方法的前十個類別共有57%種期刊,展開方法則有50%,但ISI和SOOI則分別只有15%和31%。


2) 從引用的分布情形來看,兩種以內容為基礎的分類方法的引用次數總計比以演算法為基礎的分類方法多,但隨機漫步方法和展開方法有較多比率分布在類別內,但ISI和SOOI則是主要分布在類別之間。

接下來,以引用式樣(citation patterns)的餘弦相似性(cosine similarity),比較各種分類方法的類別彼此間的相似性。結果ISI和SOOI的中位數分別是0.020和0.066,比隨機漫步方法和展開方法的0.009和0.007高許多,其原因同樣是因為內容為基礎的方法有多重分類的特性,因此類別間的邊緣較模糊,而演算法為基礎的方法在類別間切割得較清楚。然後將各種分類方法的類別依照它們的相似性繪製成網路圖。四種方法繪製的網路圖大致上都可以看出包含兩大群,一個是生物醫學,另一個則是物理學與工程學,兩個大群體透過三個群體相連,包括化學、地理學-環境科學-生態學群體、以及電腦科學,社會科學群體在網路圖上有些分離,透過行為科學/神經科學和生物醫學相連,並且也透過電腦科學與數學和物理學/工程學相連。綜上所述,不同的科學地圖是相似的,但它們在群體內部類別的密度不同。

In this study, we test the results of two recently available algorithms for the decomposition of large matrices against two content-based classifications of journals: the ISI Subject Categories and the field/subfield classification of Glänzel and Schubert (2003).

The content-based schemes allow for the attribution of more than a single category to a journal, whereas the algorithms maximize the ratio of within-category citations over between-category citations in the aggregated category-category citation matrix.

At that time, Leydesdorff & Rafols (2009) were deeply involved in testing the ISI Subject Categories of these same journals in terms of their disciplinary organization. Using the JCR of the Science Citation Index (SCI), we found 14 major components using 172 subject categories, and 6,164 journals in 2006. Given our analytical objectives and the well-known differences in citation behaviour within the social sciences (Bensman,2008), we decided to set aside the study of the (220 − 175 = ) 45 subject categories in the social sciences for a future study.

Our findings using the SCI indicated that the ISI Subject Categories can be used for statistical mapping purposes at the global level despite being imprecise in terms of the detailed attribution of journals to the categories.

In this study, we compare the results of these two algorithms with the full set of 220 Subject Categories of the ISI. In addition to these three decompositions, a fourth classification system of journals was proposed by Glänzel and Schubert (2003) and increasingly used for evaluation purposes by the Steungroep Onderwijs and Onderzoek Indicatoren (SOOI) in Leuven, Belgium. These authors originally proposed 12 fields and 60 subfields for the SCI, and three fields and seven subfields for the Social Science Citation Index and the Arts and Humanities Citation Index. Later, one more subfield entitled “multidisciplinary sciences” was added.

Thus, because research topics are, on the one hand, thinly spread outside the core group and, on the other hand, the core groups are interwoven, one cannot expect that the aggregated journal-journal citation matrix matches one-to-one with substantive definitions of categories or that it can be decomposed in a single and unique way in relation to scientific specialties. The choice of an appropriate journal set can be considered as a local optimization problem (Leydesdorff, 2006).

Citation relations among journals are dense in discipline-specific clusters and are otherwise very sparse, to the extent of being virtually non-existent (Leydesdorff & Cozzens, 2003).

The grand matrix of aggregated journal-journal citations is so heavily structured that the mappings and analyses in terms of citation distributions have been amazingly robust despite differences in methodologies (e.g., Leydesdorff, 1987 and 2007; Tijssen, de Leeuw, & van Raan, 1987; Boyack, Klavans, & Börner, 2005; Moya-Anegón et al., 2007; Klavans & Boyack, 2009).

A decomposable matrix is a square matrix such that a rearrangement of rows and columns leaves a set of square sub-matrices on the principal diagonal and zeros everywhere else.

In the case of a nearly decomposable matrix, some zeros are replaced by relatively small nonzero numbers (Simon & Ando, 1961; Ando & Fisher, 1963). Near-decomposability is a general property of complex and evolving systems (Simon, 1973 and 2002).

The decomposition into nearly decomposable matrices has no analytical solution. However, algorithms can provide heuristic decompositions when there is no single unique correct answer.

Newman (2006a, 2006b) proposed using modularity for the decomposition of nearly decomposable matrices since modularity can be maximized as an objective function.

Blondel et al. (2008) used this function for relocating units iteratively in neighbouring clusters. Each decomposition can then be considered in terms of whether it increases the modularity.

Analogously, Rosvall, and Bergstrom (2008) maximized the probabilistic entropy between clusters by estimating the fraction of time during which every node is visited in a random walk (cf. Theil, 1972; Leydesdorff, 1991).

The data were harvested from the CD-Rom version of the JCR of the SCI and Social Science Citation Index 2006, and then combined. ... The resulting set of 7,611 journals and their citation relations otherwise precisely corresponds to the online version of the JCRs. This large data matrix of 7,611 times 7,611 citing and cited journals was stored conveniently as a Pajek (.net) file and used for further processing.

The 7,611 journals are attributed by the ISI with 11,856 subject classifiers. This is 1.56 (±0.76) classifiers per journal. The ISI staff assign the 220 ISI Subject Categories on the basis of a number of criteria including the journal's title and its citation patterns (McVeigh, personal communication, March 9, 2006; Bensman & Leydesdorff, 2009).

According to the evaluation of Pudovkin and Garfield (2002), in many fields these categories are sufficient, but the authors added that “in many areas of research these ‘classifications’ are crude and do not permit the user to quickly learn which journals are most closely related” (p. 1113).

Leydesdorff and Rafols (2009) found that the ISI Subject Categories can be used for statistical purposes—the factor analysis for example can remove the noise—but not for the detailed evaluation. In the case of interdisciplinary fields, problems of imprecise or potentially erroneous classifications can be expected.

For the purpose of developing a new classification scheme of scientific journals contained in the SCIs, Glänzel and Schubert (2003) used three successive steps for their attribution. The authors iteratively distinguished sets cognitively on the basis of expert judgements, pragmatically to retain multiple assignments within reasonable limits, and scientometrically using unambiguous core journals for the classification. The scheme of 15 fields and 68 subfields is used extensively for research evaluations by the Steunpunt Onderwijs and Onderzoek Indicatoren (SOOI), a research unit at the Catholic University in Leuven, Belgium, headed by Glänzel.

The SOOI categories cover 8,985 journals. Using the full titles of the journals, 7,485 could be matched with the 7,611 journals under study in the JCR data for 2006 (which is 98.3%). These journals are attributed 10,840 classifiers at the subfield level. This is 1.45 (±0.66) categories per journal. One category (“Philosophy and Religion”) is missing because the Arts & Humanities Citation Index is not included in our data. Thus, we pursued the analysis with the 67 SOOI categories.

Using Rosvall and Bergstrom's (2008) algorithm with 2006 data, we obtained findings similar to those of these authors on August 11, 2008. Like the original authors using 6,128 journals in 2004, we found 88 clusters using 7,611 journals in 2006.

Lambiotte, one of the coauthors of Blondel et al. (2008), was so kind as to input the data into the unfolding algorithm and found the following results: 114 communities with a modularity value of 0.527708 and 14 communities with a modularity value of 0.60345. We use the 114 communities for the purposes of this comparison. These categories refer to 7,607 (= 7611 − 4) journals because four of the journals in the file were isolates.

The number of journals per category is log-normally distributed in each of the four classifications. In other words, they all have a relatively small number of categories with a large number of journals and many categories with only a few journals. However, as shown in Figure 1, the classifications based on the random walk and unfolding algorithms are more skewed than the content-based classifications.



Whereas the top-10 categories on the basis of a random walk comprise 57% of the journals (50% for unfolding), they cover only 15% in the ISI decomposition and 31% for the SOOI classification. In the case of skewed distributions, the characteristic number of journals per category can best be expressed by the median: the median is below 30 in the random walk or unfolding classifications, compared with 42 journals for the ISI classification and 141 for the SOOI classification (Table 1).


As presented in the last rows of Table 1, the total numbers of citations in the aggregated matrices based on the ISI or SOOI classifications are much higher because the same citation can be attributed to two or three categories. Thus, whereas random walk and unfolding lead to matrices with most citations within categories (on the diagonal), matrices based on ISI and SOOI classifications lead to matrices with most citations between categories (off-diagonal).

Finally, to measure how similar the categories in the four decompositions are to each other, we computed the cosine similarities in the citation patterns between each pair of citing categories in the four aggregated category-category matrices (Salton & McGill, 1983; Ahlgren, Jarneving, & Rousseau, 2003).

We find again that all the distributions are highly skewed and that the random walk and unfolding algorithms exhibit a much lower median similarity value among categories. The lower medians indicate that the algorithmic decompositions produce a much “cleaner” cut between categories than the content-based classifications.
In conclusion, the analysis of the statistical properties of the different classifications teaches us that the random walk and the unfolding algorithms produce much more skewed distributions in terms of the number of journals per category, but these constructs are more specific than the content-based classification of the ISI and SOOI. The content-based sets are less divided because the boundaries among them are blurred by the multiple assignments.

In summary, although the correspondences among the main categories are sometimes as low as 50% of the journals, most of the mismatched journals appear to fall in areas within the close vicinity of the main categories. In other words, it seems that the various decompositions are roughly consistent but imprecise.

Maps of science for each decomposition were generated from the aggregated category-category citation matrices using the cosine as similarity measure.

The similarity matrices were visualized with Pajek (Batagelj & Mrvar, 1998) using Kamada and Kawai's (1989) algorithm.

The threshold value of similarity for edge visualization is pragmatically set at cosine > 0.01 for the algorithmic decompositions and cosine > 0.2 for the content-based decompositions to enhance the readability of the maps without affecting the representation of the structures in the data.

For the ISI decomposition, the 220 categories (Figure 3) were clustered into 18 macro-categories (Figure 4) obtained from the factor analysis (cf. Leydesdorff and Rafols, 2009).


The map of the SOOI classification was constructed with all is 67 subfields (Figure 5).


Taking advantage of the concentration of journals in a few categories, in the case of random walk and unfolding only the top 30 and 35 categories were used, respectively.


Indeed, the four maps correspond in displaying two main poles: a very large pole in the biomedical sciences and a second pole in the physical sciences and engineering. These two poles are connected via three bridging areas: chemistry, a geosciences-environment-ecology group, and the computer sciences. The social sciences are somewhat detached, linked via the behavioral sciences/neuroscience to the biomedical pole, and via the computer sciences and mathematics to the physics/engineering pole.

As noted above, although categories of different decompositions do not always match with one another, most “misplaced” journals are assigned into closely neighbouring categories. Therefore, the error in terms of categories is not large and is also unsystematic. The noise-to-signal ratio becomes much smaller when aggregated over the relations among categories.

As a second important observation that can be made on the basis of these maps, we wish to point to the differences in category density between the content-based and the algorithm-based maps.

In summary, we were surprised to find that the different science maps are similar except that they differ in the density of categories within groups.

The content-based classifications achieve a more balanced coverage of the disciplines at the expense of distinguishing categories that may be highly similar in terms of journals.

The first finding is that the algorithmic decompositions have very skewed and clean-cut distributions, with large clusters in a few scientific areas, whereas indexers maintain more even and overlapping distributions in the content-based classifications.

Second, the different classifications show a limited degree of agreement in terms of matching categories. In spite of this lack of agreement, however, the science maps obtained are surprisingly similar; this robustness is due to the fact that although categories do not match precisely, their relative positions in the network among the other categories is based on distributions that match sufficiently to produce corresponding maps at the aggregated level.

沒有留言:

張貼留言