Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics,4(2), 185-193.
information visualization
本研究探討ISI主題分類(ISI subject categories)之間的資訊流(information flow)。
利用期刊彼此間的引用資料,本研究首先測量了每一個主題分類在引用上的熵值(entropy)、自我連結的指標(self-link index)以及與其他主題分類的連結強度。在本研究中,熵值用來測量主題分類的引用連結是否廣泛分布在多個主題分類上,熵值愈高,主題分類的引用和被引用愈是平均分布在多個主題分類上。熵值最高的十個主題分類有九個屬於藝術與人文(arts and humanities)領域。去除所有藝術與人文領域的分類之後,最高的十個主題分類中,社會科學(social sciences)領域的主題分類占了大部分,另外則有電腦科學以及科際整合應用(interdisciplinary applications),這個結果與Leydesdorff and Rafols (2009)利用中介中心性(betweenness centrality)作為跨學科性(interdisciplinarity)指標的測量結果相符合。主題分類的自我連結指標可以表現出該分類的專殊化(specialisation),這項指標最高的三個主題分類是天文學與天文物理(astronomy and astrophysics)、數學(mathematics)以及法律(law)。本研究並且計算每一個主題分類與其他主題分類的連結強度,以超過0.05做為兩者間有較強的連結強度,統計每一個主題分類具有較強連結強度的數量,發現個主題與那些主題有較多的引用或被引用情形。對照主題分類在引用上的熵值,可以發現社會科學的主題分類其引用情形較廣泛分布於多個主題分類,因此有較高的熵值,自然科學的主題分類其引用則較為集中,容易發現部分較強的連結強度。本研究同時也以主題分類與其他主題分類的連結強度建立以該主題分類為中心的連結映射圖(ego-centred neighbour map)來呈現這個主題分類與其較相關的主題分類之間的關係。
最後本研究嘗試將主題分類利用多層次聚合方法(Multi-level Aggregation Method, MAM) (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008)加以分群,比較以主題分類為對象和以期刊為對象(Zhang, Janssens, et al., 2009)的分群結果。MAM是一種以模組性(modularity)為基礎的社群偵測(community detection)演算法。這個演算法,第一回合將每一個節點指定為一個叢集,然後開始嘗試將任何兩個叢集合併,計算合併後的模組性,一直到模組性無法增加為止,便結束這個回合;下一回合開始時,將合併到同一叢集的節點視為一個節點,再將每一個節點指定為一個叢集,然後重複上述的合併程序,一直到所有的節點都合併成一個叢集或是模組性無法增加為止(Blondel, Guillaume, Lambiotte, & Lefebvre, 2008)。本研究利用每一回合的結果做為不同解析度(resolution)的分群結果,選擇模組性最高時的解析度。Zhang, Janssens, et al. (2009)是以期刊的內容進行分群,並且根據期刊名稱來標示分群的結果,本研究的分群則是從期刊間的引用資料聚集為主題分類間的引用,再對主題分類進行分群,最後根據主題分類決定分群結果的標示。從實驗的結果,本研究認為這兩種結果在結構上有所差異,而此差異不僅和分群的方式有關,同時有相當多期刊具有重複的主題分類也有很大的影響。因此,本研究的分群結果有助於改善期刊的主題分類。
The present study will focus on the analysis of the information flow among the ISI subject categories. This will be done for two important reasons. This exercise aims at finding an appropriate field structure of the Web of Science using the subject clustering algorithm developed in previous studies. Furthermore, since ISI subject categories are based on journal assignment the question arises of what changes if journal cross-citation is replaced by subject cross-citation. If changes are not essential, the elaborate clustering of more than 8000 journals could be substituted by a somewhat easier analysis of roughly 250 ISI categories and the journal level could, as it were, be skipped.
Boyack, Klavans, and Börner (2005) applied eight alternative measures of journal similarity to a dataset of 7121 journals covering over one million documents in the combined Science Citation and Social Sciences Citation Indexes, to show a global map of science using the force-directed graph layout tool VxOrd.
Chen (2008) proposes an approach to classify scientific networks in terms of aggregated journal-journal citation relations of the ISI Journal Citation Reports using the affinity propagation method.
As mentioned in the outset, Zhang, Glänzel, et al. (2009) and Zhang, Janssens, et al. (2009) have also investigated different methods for the analysis and classifications of scientific journals.
Glänzel and Schubert (2003) designed a new classification scheme of science fields and subfields for scientometric evaluation purposes.
Moya-Anegon et al. (2004) proposed a new technique that uses thematic classification as entities of co-citation, and presented an ego-centred network of 222 ISI categories including science and social sciences.
Leydesdorff and Rafols (2009) classified the ISI 172 science categories into 14 groups based on factor analysis, and compared the interdisciplinarity of each category using betweenness centrality.
Compared to other researchers, we applied a new clustering technique to classify the ISI science and social sciences categories into 7 groups based on the category–category cross-citation similarities, and further compared the results with the 7 hybrid clustering solution of 8305 journals in a previous study (Zhang, Janssens, et al., 2009).
The data have been collected from the Web of Science of Thomson-Reuters. Altogether 9487 journals which were assigned to the 246 categories of sciences, social sciences and arts and humanities in the entire period of 2002–2006 were selected and only three document types, namely, article, letter and review, were taken into consideration. More than six million papers were indexed and citations have been summed up through a variable citation window, from the publication year till 2006.
The clustering method adopted in this study is the Multi-level Aggregation Method (MAM) (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), which is a new clustering algorithm based on the modularity optimization.
Modularity (Newman, 2006) is a benefit function used in the analysis of networks or graphs such as computer networks or social networks. It quantifies the quality of a division of a network into modules or communities. Good divisions, having high modularity values, are those in which there are dense internal connections between the nodes within modules but only sparse connections between different modules.
The modularity of this division is defined to be the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random.
The value of the modularity lies in the range [−1,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance.
In MAM, firstly each node of the network is assigned to a single community.
The two nodes i and j are merged on the basis of the maximum modularity gain defined in Eq.(2).
The merging process is repeated until local maximum modularity is reached.
Then the current communities are employed to form super nodes to repeat the above merging process of nodes.
This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved.
When a local modularity maximum value is reached during the optimization, it will correspond to a cluster number from the formed communities.
Since generally there are several local modularity maximum values available during the optimization stage, these various cluster numbers under such modularity values can be regarded as different clustering levels (resolutions).
Therefore, we can find the most approximate number of clusters from these different clustering levels because the global modularity maximum value could be found among local modularity maximum values.
Thus Multi-level Aggregation Method provides a heuristic scheme to determine the number of clusters automatically.
The number of subject assignment in the Web of Science (SCIE, SSCI, AHCI) is 14,608 for 9487 journals during 2002–2006, namely, roughly 1.54 categories per journal. The average number of journals for each category is 59.4.
Taken into account the big share of multiple assigned journals, as well as the big share of journal self-citations, this aggregation will definitely impact or even distort the real network among categories. In order to avoid the latent distortion, we decided to exclude all the journal self-citation data before we got the aggregated category–category citation matrix. In other words, our category-to-category cross-citation matrix is aggregated from citations only among different journals.
In order to measure in how far references/citations are spread among other journals, Zhang, Glänzel, et al. (2009) have introduced the indicator of entropy. Here we used the same indicator to measure the distribution of links among different categories.
Table 1 shows the top 10 categories with highest entropies, where 9 of them are assigned to arts and humanities. This is not surprising as it is well-known that the arts and humanities tend to communicate with a large scope of different categories.
As a contrast, we present the top 10 categories with highest entropies after the exclusion of arts and humanities (see Table 2). Social sciences categories occupy a big share; computer science, interdisciplinary applications has the highest entropy among the science categories. This result is in accordance with the research of Leydesdorff and Rafols (2009), where they got the conclusion that computer science, interdisciplinary applications, is the one with the highest interdisciplinarity among all science categories, although they used another indicator: betweenness centrality.
Opposing the entropy which measures the distribution of links within the communication network, the index of self-link mainly represents the degree of isolation (see Eq.(5)).
The 10 most isolated categories are represented in Table 3, where the top three are, respectively, astronomy and astrophysics, mathematics and law. The striking values of SLI may indicate the high degree of specialisation, or the particular citation characteristics of these certain categories.
In the cross-citation network, the categories either merely spread their information over and/or collect information from a variety of other categories but regardless of their intensity (like cases in Table 2), or tend to have strong influence from/on some particular categories but relatively weak in expanding their communication scope (like cases in Table 5). In general, social sciences categories are inclined to enlarge
their link distributions, while science categories tend to have more intense links.
The categories shown in Table 5 could be considered as “central nodes” among the whole communication network. These “central” actors would form some coherent sub-clusters in the network, and act as “cores” in these clusters. It is worthwhile to have a look at those sub-clusters, where there are dense information communications.
There are indeed structural differences between the elaborate clustering of more than 8000 journals and the clustering of the ISI subject categories.
The former clustering results are generated automatically based on the journal-to-journal similarities, and are then labelled using the best TF-IDF terms from all documents under study in these individual journals; while in the clustering based on ISI subject categories, we first assign all the individual journals into different categories according to the ISI assignment and then aggregate all the journal-to-journal citation data to category-to-category citation data. The clustering is thus analyzed at the category level, and the labelling for each cluster is based on the names of ISI categories included.
Therefore, the two clustering results provide two subject classification schemes through different perspectives and levels. The two classifications are structurally comparable but differences indeed exist. The divergence between the two structures may be due to the interferences from the multiple journal assignment to ISI subject categories, and on the other hand, may also reflect some possible improvement of the journal assignment scheme in ISI.
沒有留言:
張貼留言