Thijs, B., Zhang, L., & Glänzel, W. (2015). Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. Scientometrics, 105(3), 1453-1467.
Thijs, B., Zhang, L., & Glänzel, W. (2013, January). Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. In Proceedings of ISSI (pp. 237-249).
本研究利用書目耦合(bibliographic coupling)資訊將收錄於Web of Science資料庫內的期刊分群,利用二次相似性(second order similarities)改善書目耦合資訊產生的相似性矩陣過於疏鬆的問題,再以dendrogram及silhouette測量等資訊決定由Ward的凝聚法(Ward’s agglomeration method)產生集群的數目,最後比較集群結果與Glänzel & Schubert (2003)提出的以期刊為基礎的主題分類架構(the journal-based subject-classification scheme),了解兩者間的對應情形,並決定集群的命名。
使用書目耦合的優點是因為需要的資料都已呈現在論文或資料庫上,計算論文(即本文所謂的publications)和期刊間的連結不會有延遲,並且建立連結後將會持續保持一致,不隨時間改變。然而,如同其他使用引用資訊的方法,相關的論文或期刊並無法共同具有所有的參考文獻,在相似性矩陣上無可避免地會有大量的0出現(Janssens, 2007; Janssens et al., 2008),產生極為大量的單一個體(singletons),並影響後續集群分析的品質。過去解決這個問題的一種做法是將引用文獻與詞語相似性的結果混合,例如Janssens et al. (2008);另一種的做法是以二次相似性產生相似性矩陣,例如Janssens (2007)、Ahlgren & Colliander (2009)與 Thijs et al. (2013)等研究。
本研究即是利用二次相似性對Web of Science資料庫內的期刊分群,分析的期刊為2006到2009年間出版100筆論文或以上的期刊,共8282種。作法如下:
1. 以Salton提出的餘弦測量法(cosine measure)產出一次相似(first order similarity),然後以一次相似矩陣再次進行餘弦測量法,產生二次相似性。經過二次相似性的計算後,有10種期刊沒有與其他期刊相連,因此加以移除,因此剩餘的期刊網路上共有8272種。
2. 以Ward的凝聚法產生階層集群,並利用dendrogram及silhouette測量推測可能的集群數目,在本研究由上到下分別有6、14及24種集群數目。silhouette測量方法是對每一種期刊計算一個介於-1和1之間的silhouette數值,正值代表該期刊被分配到適當的集群中。然後將期刊依照其群集分組並以silhouette數值大小排序,產生的圖形可以表示各集群的分群品質,如果在正值的部分有較大的面積,換言之,這個集群具有較多的期刊具有適當的分配,則代表有較好的集群劃分結果。
為了檢驗分群的效果,產生的集群結果與Glänzel & Schubert (2003)提出的以期刊為基礎的主題分類架構進行比較,並且在每一個集群上找出具有代表性的核心期刊(core journals)來分析結果的每個期刊集群。圖五是以網絡來表示集群之間的關係,在圖上可以發現藝術與人文(Arts and Humanities)遠離其他集群,神經科學和行為科學(Neurosciences & Behaviour)介於社會科學和生命科學之間,化學則處於生物科學(Biosciences)、醫學和物理學的中間,較特別的是 一般、區域與社區議題(General,Regional and Community Issues)與生命科學之間有很強的連結。
此外,並且計算14個集群與Glänzel & Schubert (2003)的15個學科(排除Multidiscipline下的期刊)之間的Jaccard指標,呈現為表三。
除了與Glänzel & Schubert (2003)的期刊架構比較以外,本研究也將分群的結果與 ESI (essential science indicators)的類別比較,然而卻發現ESI的劃分與本研究分類結果的結構並不一致。
An attempt is made to apply bibliographic coupling to journal clustering of the complete Web
of Science database. Since the sparseness of the underlying similarity matrix proved
inappropriate for this exercise, second-order similarities have been used.
Cluster labelling was made on the basis of the about 70 subfields of the Leuven-Budapest
subject-classification scheme that also allowed the comparison with the existing
two-level journal classification system developed in Leuven. The further comparison with the
22 field classification system of the Essential Science Indicators does, however, reveal larger
deviations.
The issue of subject classification and the creation of coherent journal sets has been a major
topic in our field since the seventies (see e.g., Narin et al., 1972; Narin, 1976).
The
development of computerised methods and the availability of large datasets have shifted the
attention from mapping small or single disciplines to the generation of global science maps
(Garfield, 1998).
Jarneving
(2005) applied bibliographic coupling to map and to analyse the structure of an annual
volume of the Science Citation Index.
Janssens et al. (2008; 2009) used a combination of
cross-citations and a lexical approach to map journals. Zhang et al. (2010) validated this
approach.
The advantage of bibliographic coupling is that there is no delay for the calculation of the link
between publications or journals as all data needed are present upon publication or indexing
in the database. This also means that link between documents, once established will remain
constant over time.
This disadvantage is a result of the very sparse nature of the link matrix (Janssens,
2007; Janssens et al., 2008). The overwhelming number of document pairs does not share any
reference at all and thus a large number of zeros occur in the similarity matrix. This
deteriorates the quality of the subsequent clustering and may result in an unrealistic large
number of singletons (cf. Jarneving, 2005).
As cross-citation data suffers from the same
problem, Janssens et al. (2008) introduced a hybrid approach, where they combined citation-based
with lexical similarities.
Another solution to overcome the sparseness problem is the use of second order similarities
(Janssens, 2007; Ahlgren & Colliander, 2009; Thijs et al., 2013).
A set of journals was compiled from the Web of Science database (SCI-Expanded, SSCI and
AHCI). All journals covered in this database between 2006 and 2009 with at least 100
publications in this period are taken into account. This resulted in a set of 8282 journals.
To express the strength of a link between two journals we calculated a first order similarity
based on Salton’s cosine measure. The mathematical derivation and interpretation of this
similarity measure in the framework of a Boolean vector space model can be found in (Sen &
Gan, 1983; Glänzel & Czerwon, 1996).
As bibliographic coupling tends to produce very
sparse similarity matrices we applied a second order similarity to reduce this effect. While the
first-order similarity is based on the angle between two reference vectors, the second-order
similarity is calculated as the cosine of the angle of two vectors holding the first order
similarity between two journals.
After the calculation of the second-order similarities, ten
journals were removed from the set as they appeared to be singletons without any link to the
other journals in the set. The network thus included 8272 journals in total.
Hierarchical clustering with Ward’s agglomeration method was used to create a hard
clustering of all the journals.
This method does not provide any
automated optimum number of clusters so that the decision was made on the basis of the
dendrogram and the silhouette statistics (Rousseeuw, 1987).
Three different levels were chosen. The dendrogram
holds strong arguments for a six cluster partitioning while the silhouette plot shows a first
peak at 7 clusters. For the highest hierarchical level in the following analysis we use the six
cluster solution. At a lower level, the silhouette plot suggests the solutions with 14 and 24
clusters, respectively.
For the evaluation of the specific cluster solution we can rely on the silhouette graphs
presented in Figure 4. Each graph presents the silhouette values of the journals in the
respective cluster. For each journal a silhouette value is calculated. These values range
between 1 and -1 where positive values indicate an appropriate clustering of the journals.
Journals are grouped by cluster and ordered from highest silhouette value to lowest. As a
consequence the graph gives a good profile of the quality of each cluster. A larger area at the
positive side of the vertical axis thus represents a better partitioning.
In order to find an acceptable solution, we decided to use the journal-based subject-classification
scheme developed in Leuven (Glänzel & Schubert, 2003). This solution proved
most advantageous since both clustering and classification scheme are based on journal
assignment. Table 1 presents the hierarchical structure of the three level partitioning. For each
cluster the number of journals is mentioned. The labels for the higher levels can be deduced
from the lowest level. These labels are taken from the Leuven classification system . The label
from the most prominent subject category has been assigned to the corresponding cluster.
Another way to describe the cluster is by using core journals. This notion can be analogously
defined as core documents introduced by Glänzel & Czerwon (1996) and extended by Glänzel
& Thijs (2011).
In this particular application, a core journal can be identified as journal with
at least n links with other journals of at least a given strength r on the second order similarity
measure. For the identification of core journals in each cluster we set the number of strong
links to at least half the set of journals in the cluster.
As we are using second order similarities
this choice is not unreasonable. The value of the strength is chosen such that 12 journals
within each cluster comply with both criteria. This means that for more dense clusters the
choice of appropriate r-value is higher than in clusters where the journals are not as strongly
linked.
Above all, chemistry is at
each level a separate cluster. One might expect that at the highest level, chemistry is merged
with Physics but we found different patterns.
The second noteworthy observation concerns
cluster 17 (Public Health & Nursing). This is a cluster within the ‘Psychology –
Neuroscience’ cluster at the highest, six-cluster level. In other partitions or subject
classification systems this is attributed to Non-Internal Medicine.
To visualise relations between the 24 clusters we created an additional map. Figure 5 shows
these relations.
Despite these multiple assignments we
used the Jaccard Index to measure the concordance between the two journal
The results are presented in Table 3.
Arts and Humanities is an outlier, Neurosciences & Behaviour acts as a bridge between Social Sciences and Life Sciences, Chemistry takes a central position between Biosciences, Medical Sciences and Physics.Most striking observation in their map is the position of General,Regional and Community Issues which is strongly linked with the Life Science fields.
A 24 cluster solution can be compared with the 22 categories from the classification of
Thomson Reuters’ Essential Science Indicators (ESI).
Janssens et al. (2009) showed very low mean silhouette values
for the ESI category system in a space with respectively textual distances, cosine similarities
of cross-citation vectors and combined distances.
Also in the present study, not all clusters
have a unique counterpart in the ESI classification system and vice versa (cf. Janssens et al.,
2009). Notably, the ESI fields clinical medicine and engineering, mathematics and social
sciences, general are almost uniformly spread over numerous clusters.
Based on this analysis we have to conclude that the segmentation of journals in the ESI categories is not supported by the structure found with bibliographic coupling between journals.
Given this rather weak association between the clustering based on cross-citations and bibliographic coupling, it is a legitimate question to ask which of both methodologies is performing best. A comparison of the mean silhouette values and the silhouette values within each cluster reveals that the methodology presented in this paper results in a more consistent solution.
The 15 cluster solution of cross-citation has a value of 0.04 while the bibliographic coupling results in a value of 0.13.
The main
advantage of this method is that clustering can be made as soon as a new database volume is
available. The only issue is the lacking cluster labelling that cannot directly be obtained from
the method. As a substitute, intellectual classification schemes can be used as reference
system. Cluster labelling was made on the basis of the Leuven-Budapest subject-classification
scheme that also allowed the comparison with the existing two-level journal classification
system developed in Leuven.
The further comparison with the 22 field classification system of the Essential Science
Indicators does, however, revealed some striking deviations. These concerned, above all, the
fields of clinical medicine, engineering, mathematics and the social sciences. New developments in computer science, neuroscience and psychology as well as in public health
(cf. Glänzel & Thijs, 2011) do certainly contribute to such growing deviation.
The main objective of this study was to analyse whether the proposed methodology is
appropriate for multi-level journal clustering and to what extent the solutions fit in the
framework of traditional subject classification. Further comparison with other solutions such
as cross-citation and hybrid methods will be part of future research.
沒有留言:
張貼留言