2016年7月11日 星期一

Wang, Q., & Waltman, L. (2016). Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus. Journal of Informetrics, 10(2), 347-364.

Wang, Q., & Waltman, L. (2016). Large-scale analysis of the accuracy of the journal classification systems of Web of Science and Scopus. Journal of Informetrics10(2), 347-364.

本研究以引用資料比較Web of Science和Scopus兩個資料庫提供的期刊分類系統(journal classification systems)的正確性。分類系統能應用於各種問題;例如,它可以被用來標定研究區域(Glänzel & Schubert, 2003; Waltman & Van Eck, 2012),評估和比較研究對各領域的影響(Leydesdorff and Bornmann, 2015; Van Eck, Waltman, Van Raan, Klautz, & Peul, 2013),以及跨學科的研究(Porter & Rafols, 2009; Porter, Roessner, & Heberger, 2008)。除了Web of Science和Scopus之外,期刊分類系統還有Science-Metrix、NSF(National Science Foundation)分類系統、the UCSD (University of California, San Diego)分類系統以及ANZSRC (Australian and New Zealand Standard Research Classification),另外Glänzel and Schubert (2003)也提出一個包括期刊和論文的階層式分類系統,以演算法建構的期刊分類方法也有Bassecoulard and Zitt (1999)、Chen (2008)以及 Rafols and Leydesdorff (2009)等,Waltman and Van Eck (2012)的演算法則是以期刊裡出版的論文分類為主。

根據Waltman (2015, Section 3)的文獻分析,在比較Web of Science和Scopus時,主要針對資料庫的覆蓋情形(the coverage of the databases),例如LópezIllescas, De Moya-Anegón, & Moed (2008)、 Meho & Rogers (2008)、 Mongeon & Paul-Hus (2016)、 Norris & Oppenheim (2007),或是資料庫用來評估研究產生與影響的準確性,如Archambault, Campbell, Gingras, & Larivière (2009)、Bar-Ilan, Levene, & Lin (2007)、Meho & Rogers (2008)、Meho & Sugimoto (2009),並未有研究比較與分析它們的分類系統的準確性。

過去Pudovkin and Garfield (2002)曾說明WoS首先利用人工的經驗法則將期刊分配到各類別,之後使用根據引用資料的Hayne-Coulson演算法對新的期刊進行分類。除此之外,Katz and Hicks (1995)、Leydesdorff (2007)、Leydesdorff and Rafols (2009)等研究也曾指出WoS的分類系統是綜合了引用模式、期刊題名與專家意見。但Scopus則未曾有文獻提到其分類系統的建構方式。事實上,在WoS上有兩個分類系統,一個為具有250個類別的類別系統(a system of categories),另一個則是包含約150個研究領域的研究領域系統(a system of research areas),此外,另一個分類系統僅包含科學與社會科學,稱為ESI (Essential Science Indicators)。本研究的分析對象是WoS上的類別系統。

Scopus的期刊分類系統則名為ASJC ( All Science Journal Classification),分為兩個層級,下層有304個類別,上層則分為27個類別。

既然期刊分類系統相當有用,因此有許多研究提出對WoS及Scopus的分類系統進行改善的方法,例如Glänzel等人研究各種方法來驗證與改善WoS的分類系統 (Janssens, Zhang, De Moor, & Glänzel, 2009; Thijs, Zhang, & Glänzel, 2015; Zhang, Janssens, Liang, & Glänzel, 2010),López-Illescas, Noyons, Visser, De Moya-Anegón, & Moed (2009) 則是對利用WoS分類系統進行的領域劃分,提出改進的方法。在Scopus分類系統的改進方面,則有SCImago團隊(Gómez-Núnez, ˜ Vargas-Quesada, De Moya-Anegón, & Glänzel, 2011, Gómez-Núnez, ˜ Batagelj, Vargas-Quesada, De Moya-Anegón, & Chinchilla-Rodríguez, 2014, Gómez-Núnez, ˜ Vargas-Quesada, & De Moya-Anegón, 2016)。

用來評估期刊分類系統準確性的方法可分為以專家為基礎的方法與書目計量方法(bibliometric approach)。以專家為基礎的方法在遭遇大量資料時有很大的困難,沒有專家有足夠知識來評估所有科學學科上的期刊分類,因此需要相當多的專家加入。書目計量方法可再分為以文本為基礎與以引用為基礎兩種方法,分別以在同一類別下期刊論文的文本相似性與引用模式的相似性大小做為衡量期刊是否應在同一類別的標準。本研究採用的是直接引用(direct citation)關係,先前Klavans and Boyack (2015)曾經利用直接引用關係建構論文的分類系統演算法,他們的結論認為直接引用較書目耦合(bibliographic coupling)或共被引(co-citation)等間接地引用關係更加準確。

綜合以上所述,本研究提出方法的原理可歸納為任一期刊引用該期刊所屬類別下的期刊或被這些期刊所引用的頻次必然較其他類別的期刊之頻次來得高。根據這樣基本原則,本研究制訂兩個檢驗期刊是否被指定於適當類別的標準:
標準1:某一期刊與它所屬類別的其他期刊之間若是只有相當少的引用關係,則這個期刊的分類可能有問題。
標準2:如果某一期刊與其他類別的期刊之間有相當多的引用關係,則這個期刊可能被分到不正確的類別下。

本研究以2010到2014年的WoS及Scopus資料庫上的所有期刊作為分析資料,相關的統計數據如表1所示,比較兩者所收錄的資料,Scopus資料庫除了比WoS更多的期刊種類與類別以外,每一種期刊被指定的類別數目也通常比較多,WoS上的期刊平均被指定到約1.6個類別,但是Scopus則為2.1。


根據第一項標準,WoS及Scopus兩個資料庫上都有許多期刊被指定到不適合的類別上,並且Scopus尤為嚴重,如表3所示。在各個至少有10種期刊的類別當中,選擇至少有一半的期刊符合標準一的類別,這樣的類別,WoS共有17個,Scopus則有高達76個,在兩個資料庫上都出現的類別包括建築學(ARCHITECTURE)、生物物理學(BIOPHYSICS)以及 醫學實驗室技術(MEDICAL LABORATORY TECHNOLOGY)。



兩個資料庫的情況在標準二上都有還不錯的結果(表6)。


同時符合標準一與標準二的期刊,一方面與本身被指定的類別只有較弱的連結,另一方面則與未被指定的類別有較強的連結,分析同時符合標準一與標準二的期刊可以發現,一種可能是在這些期刊上的發表已經和它們的題名和範圍宣告(scope statement)有所差異,另一種可能則是在分類時僅依賴它們的題名。

根據以上的實驗結果,可以歸納以下的幾點結論:
1. 在標準一,WoS的表現比Scopus還要好,因此可以說,Scopus上的期刊通常與它們被指定的類別只有較弱的連結。
2. 在標準二,兩個資料庫的表現都相當好,也就是如果某一期刊與某一個類別的連結較強的話,WoS及Scopus通常會將它指定到這個類別。
3. 整合兩個標準,WoS比Scopus的表現通常要好上許多。

除了上述的結論外,本研究還指出Scopus有些類別有容易混淆的名稱,例如有兩個類別分別命名為LINGUISTICS & LANGUAGE和LANGUAGE & LINGUISTICS,另兩個則為INFORMATION SYSTEMS & MANAGEMENT與MANAGEMENT INFORMATION SYSTEMS。

而且兩個分類系統都缺乏透明性,本研究的作者沒有發現建構與更新分類系統的適當文件。


To examine and compare the accuracy of journal classification systems, we define two criteria on the basis of direct citation relations between journals and categories. We use Criterion I to select journals that have weak connections with their assigned categories, and we use Criterion II to identify journals that are not assigned to categories with which they have strong connections. If a journal satisfies either of the two criteria, we conclude that its assignment to categories may be questionable.

Accordingly, we identify all journals with questionable classifications in Web of Science and Scopus. Furthermore, we perform a more in-depth analysis for the field of Library and Information Science to assess whether our proposed criteria are appropriate and whether they yield meaningful results.

It turns out that according to our citation-based criteria Web of Science performs significantly better than Scopus in terms of the accuracy of its journal classification system.

Classifying journals into research areas is an essential subject for bibliometric studies.

A classification system can assist with various problems; for instance, it can be used to demarcate research areas (e.g., Glänzel & Schubert, 2003; Waltman & Van Eck, 2012), to evaluate and compare the impact of research across scientific fields (e.g., Leydesdorff and Bornmann, 2015; Van Eck, Waltman, Van Raan, Klautz, & Peul, 2013), and to study the interdisciplinarity of research (e.g., Porter & Rafols, 2009; Porter, Roessner, & Heberger, 2008).

Besides the WoS and Scopus classification systems, there are various other multidisciplinary classification systems, for instance the system of Science-Metrix,the system of the National Science Foundation (NSF) in the US,the UCSD classification system, and the system of the Australian and New Zealand Standard Research Classification (ANZSRC).

Science-Metrix assigns “individual journals to single, mutually exclusive categories via a hybrid approach combining algorithmic methods and expert judgment” (Archambault, Beauchesne, & Caruso, 2011, p. 66). The Science-Metrix system includes 176 categories.

The NSF system also offers a mutually exclusive classification of journals, but it is more aggregated, consisting of only 125 categories (Boyack & Klavans, 2014). The system is used in the Science & Engineering Indicators of the NSF.

A more detailed classification system is the so-called University of California, San Diego (UCSD) classification system. This system, which includes more than 500 categories, has been constructed in a largely algorithmic way. The construction of the UCSD classification system is discussed by Börner et al. (2012).

The ANZSRC’s Field of Research (FoR) classification system has a three-level hierarchical structure. Journals are classified at the top level and at the intermediate level. Journals can have multiple classifications.

Furthermore, Glänzel and Schubert (2003) designed a two-level hierarchical classification system, which can be applied at the levels of both journals and publications. They adopted a top-bottom strategy; specifically, they first defined categories on the basis of the experience of bibliometric studies and external experts. They then assigned journals and individual publications to the categories. This classification system has for instance been used for measuring interdisciplinarity. In their analysis of interdisciplinarity, Wang, Thijs, & Glänzel (2015) explain that instead of the WoS subject categories they use the more aggregated classification system developed by Glänzel and Schubert (2003).

Algorithmic approaches to construct classification systems at the level of journals have been studied by for instance Bassecoulard and Zitt (1999), Chen (2008), and Rafols and Leydesdorff (2009).

A more recent development is the algorithmic construction of classification systems at the level of individual publications rather than journals. Waltman and Van Eck (2012) developed a methodology for algorithmically constructing classification systems at the level of individual publications on the basis of citation relations between publications. Their approach has for instance been used in the calculation of field-normalized citation impact indicators (Ruiz-Castillo & Waltman, 2015).

According to a recent literature review (Waltman, 2015, Section 3), previous studies comparing WoS and Scopus are mainly focused on two aspects. One is the coverage of the databases (e.g., LópezIllescas, De Moya-Anegón, & Moed, 2008; Meho & Rogers, 2008; Mongeon & Paul-Hus, 2016; Norris & Oppenheim, 2007) and the other is the accuracy of the databases when used to assess research output and impact at different levels, ranging from individual researchers to departments, institutes, and countries (e.g., Archambault, Campbell, Gingras, & Larivière, 2009; Bar-Ilan, Levene, & Lin, 2007; Meho & Rogers, 2008; Meho & Sugimoto, 2009). However, no study has systematically compared WoS and Scopus in terms of the accuracy of their journal classification systems.

In the case of WoS, Pudovkin and Garfield (2002) have offered a brief description of the way in which categories are constructed. According to Pudovkin and Garfield, when WoS was established, a heuristic and manual method was adopted to assign journals to categories, and after this, the so-called Hayne-Coulson algorithm was used to assign new journals. This algorithm is based on a combination of cited and citing data, but it has never been published.

Besides this, Katz and Hicks (1995), Leydesdorff (2007), and Leydesdorff and Rafols (2009) have indicated that the WoS classification system is based on a comprehensive consideration of citation patterns, titles of journals, and expert opinion.

In the case of Scopus, there seems to be no information at all on the construction of its classification system.

It should be mentioned that in the most recent versions of WoS two classification systems are available, namely a system of categories and a system of research areas.

The system of categories is more detailed. This system, which is the traditional classification system of WoS and the system on which we focus our attention in this paper, consists of around 250 categories and covers the sciences, social sciences, and arts and humanities.

The system of research areas, which has become available in WoS more recently, is less detailed and comprises around 150 areas.

Besides these two systems, Thomson Reuters also has a classification system for its Essential Science Indicators. This system consists of 22 subject areas in the sciences and social sciences. It does not cover the arts and humanities.

The Scopus journal classification system is called the All Science Journal Classification (ASJC). It consists of two levels. The bottom level has 304 categories, which is somewhat more than the about 250 categories in the WoS classification system. The top level includes 27 categories.

The accuracy of a classification system can seriously influence bibliometric studies. For instance, Leydesdorff and Bornmann (2015) investigated the use of the WoS categories for calculating field-normalized citation impact indicators. They focused specifically on two research areas, namely Library and Information Science and Science and Technology Studies. Their conclusion is that “normalizations using (the WoS) categories might seriously harm the quality of the evaluation”.

A similar conclusion was reached by Van Eck et al. (2013) in a study of the use of the WoS categories for calculating field-normalized citation impact indicators in medical research areas.

Glänzel and colleagues have studied several approaches to validate and improve WoS-based classification systems (Janssens, Zhang, De Moor, & Glänzel, 2009; Thijs, Zhang, & Glänzel, 2015; Zhang, Janssens, Liang, & Glänzel, 2010). They have also proposed an improved way of handling publications in multidisciplinary journals (Glänzel, Schubert, & Czerwon, 1999;Glänzel, Schubert, Schoepflin, & Czerwon, 1999).

Related to this, López-Illescas, Noyons, Visser, De Moya-Anegón, & Moed (2009) have studied an approach to improve the field delineation provided by categories in the WoS classification system.

The SCImago research group has made a number of attempts to improve the Scopus classification system (Gómez-Núnez, ˜ Vargas-Quesada, De Moya-Anegón, & Glänzel, 2011, Gómez-Núnez, ˜ Batagelj, Vargas-Quesada, De Moya-Anegón, & Chinchilla-Rodríguez, 2014, Gómez-Núnez, ˜ Vargas-Quesada, & De Moya-Anegón, 2016).

Two types of approaches can be distinguished for assessing the accuracy of journal classification systems. One is the expert-based approach and the other is the bibliometric approach.

Applying the expert-based approach at a large scale is challenging. No expert has sufficient knowledge to assess the classification of journals in all scientific disciplines, so a large number of experts would need to be involved.

In the case of the bibliometric approach, a further distinction can be made between text-based and citation-based approaches.

Text-based approaches could for instance assess whether the textual similarity of publications in journals assigned to the same category is higher than the textual similarity of publications in journals assigned to different categories.

Instead, we take a citation-based approach to assess the accuracy of journal classification systems.

In this paper, we use direct citation relations. This is because “a co-citation or bibliographic coupling relation requires two direct citation relations” (Waltman & Van Eck, 2012, p. 2380), which means that bibliographic coupling and co-citation relations are more indirect signals of the relatedness of journals than direct citation relations.

The use of direct citation relations is also supported by Klavans and Boyack (2015), who study the algorithmic construction of classification systems at the level of individual publications. They conclude that the use of direct citation relations yields more accurate results than the use of bibliographic coupling or co-citation relations.

Thus, the rationale of our approach can be summarized as follows: A journal should cite or be cited by journals within its own category with a high frequency in comparison with journals outside its category.

Based on this basic principle, we define two criteria to identify journals with questionable classifications. One criterion is that if a journal has only a very small number of citation relations with other journals within its own category, then we believe the classification of the journal to be questionable. The other criterion is that if a journal has many citation relations with journals in a category to which the journal itself does not belong, then it seems likely that the journal incorrectly has not been assigned to this category.

We retrieved from the WoS and Scopus databases all journals that have publications between 2010 and 2014....The choice of a five-year time window is a trade-off between on the one hand the stability of journal classification systems and on the other hand the accuracy of our approach based on direct citation relations.

As can be seen in Table 1, the number of Scopus journals included in the analysis is almost twice as large as the number of WoS journals, and Scopus also includes 80 more categories than WoS. Furthermore, although both databases often assign journals to multiple categories, we found that Scopus tends to assign journals to more categories than WoS. WoS assigns journals to at most six categories, whereas in Scopus there turns out to be a journal that is assigned to 27 categories. Additionally, we found that the average number of categories to which journals belong equals 1.6 in WoS and 2.1 in Scopus. This shows that on average journals have significantly more category assignments in Scopus than in WoS.

As can be seen, almost 60% of all journals in WoS belong to only one category, whereas in Scopus more than 60% of all journals are assigned to two or more categories.


WoS has 1390 journals with ti < 100, accounting for 11% of the total number of WoS journals, whereas Scopus has 5808 journals with ti < 100, which is 24% of the total.3 Hence, Scopus has more journals with ti < 100 than WoS not only in an absolute sense but also from a relative point of view.

Taking a further look at Scopus journals with ti < 100, it turns out that they can be roughly divided into three groups. One group consists of arts and humanities journals, another group consists of newly included journals, and a third group consists of non-English language journals.

Table 2 provides some basic statistics on the assignment of journals to categories in WoS and Scopus when journals with ti < 100 and assignments of journals to multidisciplinary categories are excluded. The table shows the number of journals that belong to at least one non-multidisciplinary category and the number of assignments of journals to non-multidisciplinary categories. As can be seen in the table, in the case of Scopus the constraints that we have introduced cause a much larger decrease in the number of journals and the number of journal-category assignments than in the case of WoS.

Table 3 reports for both WoS and Scopus and for three values of the threshold˛the number of journals and the number of journal-category assignments that satisfy Criterion I.

As can be seen, both databases have assigned a significant number of journals to categories that according to Criterion I seem to be inappropriate.

Moreover, no matter which threshold is considered, Scopus performs substantially worse than WoS, not only in the absolute number of journals and journal-category assignments satisfying Criterion I but, more importantly, also in the percentage of journals and journal-category assignments satisfying the criterion.

Next, we identify WoS and Scopus categories with a high percentage of journals satisfying Criterion I. The identified categories may be seen as the most problematic categories in the two databases, because many of the journals belonging to these categories are only weakly connected to each other in terms of citations.

We select categories that include at least 10 journals with ti ≥ 100 and that, for α˛= 0.1, have at least 50% of their journals satisfying Criterion I. The results for WoS and Scopus are reported in Tables 4 and 5, respectively. In the case of WoS 17 categories have been identified, whereas in the case of Scopus 76 categories have been identified, so more than four times as many as in the case of WoS.

There are three categories that have been identified in the case of both databases: ARCHITECTURE, BIOPHYSICS, and MEDICAL LABORATORY TECHNOLOGY.

Table 6 presents for both WoS and Scopus and for five values of the threshold ˇ the number of journals that satisfy Criterion II.



A journal satisfies both Criterion I and Criterion II if on the one hand it has weak connections, in terms of citations, with its assigned categories while on the other hand it has a strong connection with a category to which it is not assigned. More precisely, our focus is on journals for which the current category assignments all satisfy Criterion I, while there is an alternative category assignment that satisfies Criterion II.

Based on the three journals discussed above, we conclude that journals satisfying the combined Criteria I and II can be classified into at least two types. One type refers to journals for which there is a discrepancy between on the one hand their title and their scope statement and on the other hand what they have actually published.  ... The second type refers to journals that seem to have been assigned to a category based only on their title.

First, WoS performs much better than Scopus according to Criterion I. Using the parameter values ˛= 0.05 and ˛= 0.1, the percentage of journals and journal-category assignments satisfying Criterion I is more than two times higher for Scopus than for WoS. Hence, in Scopus journals are assigned to categories with which they are only weakly connected much more frequently than in WoS.

Second, based on Criterion II, WoS and Scopus both perform reasonably well, with WoS having a somewhat better performance than Scopus. For all parameter values that were considered, less than 5% of all journals in WoS and Scopus satisfy Criterion II. In other words, if a journal is strongly connected to a category, WoS and Scopus typically assign the journal to that category.

Third, WoS also presents a significantly better result than Scopus based on the combined Criteria I and II. In WoS there is only one journal satisfying the combined criteria, whereas in Scopus there are 32.

First, Scopus sometimes has confusing category labels. In particular, Scopus sometimes has two categories with very similar labels. Examples are the categories LINGUISTICS & LANGUAGE and LANGUAGE & LINGUISTICS and the categories INFORMATION SYSTEMS & MANAGEMENT and MANAGEMENT INFORMATION SYSTEMS.

Second, lack of transparency is a weakness of both the WoS and the Scopus classification system. We did not find proper documentation of the methods used to construct and update the WoS and Scopus classification systems.

For instance, in the case of a small category, it may be hardly possible for a journal to have a reasonably high relatedness with the category. Therefore it can be expected that many journals belonging to the category will satisfy Criterion I. This may be caused not so much by the misclassification of these journals but more by the small size of the category. On the other hand, in the case of a large category, there may be other problems. A large category may for instance be of a heterogeneous nature and may cover multiple fields that are hardly connected to each other.

2016年7月10日 星期日

Janssens, F., Zhang, L., De Moor, B., & Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing & Management, 45(6), 683-702.

Janssens, F., Zhang, L., De Moor, B., & Glänzel, W. (2009). Hybrid clustering for validation and improvement of subject-classification schemes. Information Processing & Management45(6), 683-702.



科學認知映射(cognitive mapping of science)能將科學的結構(the structure of science)加以視覺化,起初應用於資訊服務(information services)、後來發現也可將其應用在科學政策(science policy)與研究評鑑(research evaluation),現在則有越來越多將其應用在發現新興與正在融合中的領域以及主題劃分(subject delineation)的改善。

科學認知映射主要可分為依據引用資訊、依據文本以及混合上述兩種資訊等三種方法。本研究利用文本與引用資訊混合的方法,將2002-2006年Web of Science資料庫內的期刊進行集群,以集群結果產生的認知映射,檢驗目前的期刊主題分類架構,如果可行的話,也將提出改善方式。

本研究首先評估ESI (Essential Science Indicators) 的22個領域主題分類架構,並且將其視覺化。圖1的左右分別是以交互引用與文本方式測量22個ESI領域的Silhouette值,從圖上發現生物學及生物化學(#2)、臨床醫學(#4)、工程學(#7)、植物及動物科學(#19)以及社會科學(#21)等領域上的期刊並沒有足夠好的一致性(coherent)。



從詞語的TF-IDF可以找出每個領域的描述詞語,而這裡也可發現不少領域的描述語互有重疊,例如工程學(#7)與電腦科學(#5)、化學(#3)與材料科學(#11)、植物及動物科學(#19)與環境/生態學(#8),以及生物學及生物化學(#2)、分子生物學及遺傳學(#14)與臨床醫學(#4)等等,此外,從描述社會科學(#21)的詞語也可以了解這個領域高度的異質性(heterogeneity)。


圖2則是以Pajek畫出22個ESI領域的結構圖,圖上也可發現生物學及生物化學(#2)與臨床醫學(#4)、化學(#3)與材料科學(#11)、電腦科學(#5)與工程學(#7)、環境/生態學(#8)與植物及動物科學(#19)等領域之間有很強的關連。

然後將約8300種期刊利用餘弦相似法及Wade的凝聚式階層集群演算法(Wade's  agglomerative hierarchical cluster algorithm)進行集群,再比較集群結果與分類架構。值得說明的是文字部分的資訊可以提供集群結果的標示(labelling),而引用部分則可產生交互引用圖(cross-citation graph)提供視覺化,並且輸入PageRank演算法以決定代表性期刊。

決定集群的數目可以根據集群結果的品質,而集群品質有內在或外在驗證測量(internal or external validation measures)等兩種評估方式。內在驗證只考慮資料與集群的統計特性,例如dendrogram、Silhouette值與模組性(modularity)等;外在驗證需要將集群結果與一個已知的劃分標準進行比較,例如計算兩者間的Jaccard相似性 (Jaccard similarity)。本研究以dendrogram的視覺化方式將期刊首先分為三大群,再分為七群,最後分為22群,三大群約等於自然與應用科學(生物學、農學及環境科學;物理、化學及工程學;數學及電腦科學)、醫學與社會科學以及人文學。以TF-IDF描述語來看,七個群組中有三個屬於自然與應用科學,兩個是生命科學(生物科學及生物醫學與臨床、實驗醫學及神經科學)與兩個是社會科學以及人文學(經濟學、商學及政治學與心理學、社會學及教育學)

圖6是22個群組的結構圖,圖上可看到屬於社會科學以及人文學的群組(#1、#6、#14及#22與#9、#11及#21)、地理學、環境科學、生物學及農學(#2、#15及#19)、物理、化學及工程學(#4、#20及 #5)、數學及電腦科學(#8及#18)、生物科學及生物醫學(#3、#13及#16)與臨床、實驗醫學及神經科學(#7、#10、#12及#17)。



表3比較22個ESI領域與本研究利用引用、文本以及混合等方法產生的22個群組的集群品質,可以發現混合引用及文本資料在各種指標上幾乎都有最好的表現。


圖8是利用Jaccard指標比較集群結果與ESI架構的一致性(concordance)。



最後,本研究並分析期刊轉移(migration)的情形,也就是期刊不屬於原本ESI架構的領域所對應的集群,而被分配到另一個不同集群的現象,好的轉移(Good migration)能使分類的一致性增加,也就是Silhouette值或是模組性增加。本研究希望利用這個現象,從集群與領域的一致性的基礎上提出改善目前期刊主題分類架構的方法。

The main bibliometric techniques are characterised by three major approaches, particularly the analysis of citation links (cross-citations, bibliographic coupling, co-citations), the lexical approach (text mining), and their combination.

A hybrid text/citation-based method is used to cluster journals covered by the Web of Science database in the period 2002–2006. The objective is to use this clustering to validate and, if possible, to improve existing journal-based subject-classification schemes.

In a first step, the 22-field subject-classification scheme of the Essential Science Indicators (ESI) is evaluated and visualised. In a second step, the hybrid clustering method is applied to classify the about 8300 journals meeting the selection criteria concerning continuity, size and impact.

The hybrid method proves superior to its two components when applied separately. The choice of 22 clusters also allows a direct field-to-cluster comparison, and we substantiate that the science areas resulting from cluster analysis form a more coherent structure than the ‘‘intellectual” reference scheme, the ESI subject scheme.

Moreover, the textual component of the hybrid method allows labelling the clusters using cognitive characteristics, while the citation component allows visualising the cross-citation graph and determining representative journals suggested by the PageRank algorithm.

Finally, the analysis of journal ‘migration’ allows the improvement of existing classification schemes on the basis of the concordance between fields and clusters.

The history of cognitive mapping of science is as long as the history of computerised scientometrics itself. While the first visualisations of the structure of science were considered part of information services, i.e., an extension of scientific review literature (Garfield, 1975, 1988), bibliometricians soon recognised the potential value of structural science studies for science policy and research evaluation as well. At present, the identification of emerging and converging fields and the improvement of subject delineation are in the foreground.

The main bibliometric techniques are characterised by three major approaches, particularly the analysis of citation links (cross-citations, bibliographic coupling, co-citations), the lexical approach (text mining), and their combination.

For instance, clustering based on co-citation and bibliographic coupling has to cope with several severe methodological problems. This has been reported, among others by Hicks (1987) in the context of cocitation analysis and by Janssens, Glänzel, and De Moor (2008) with regard to bibliographic coupling. One promising solution is to combine these techniques with other methods such as text mining (e.g., combined co-citation and word analysis: Braam, Moed, & Van Raan, 1991a; combination of coupling and co-word analysis: Small (1998); hybrid coupling-lexical approach: Janssens, Glänzel, & De Moor, 2007; Janssens et al., 2008).

Jarneving (2005) proposed a combination of bibliometric structure–analytical techniques with statistical methods to generate and visualise subject coherent and meaningful clusters. His conclusions drawn from the comparison with ‘intellectual’ classification were rather sceptical.

Despite several limitations, which will be discussed further in the course of the present study, cognitive maps proved useful tools in visualising the structure of science and can be used to adjust existing subject-classification schemes even on the large scale as we will demonstrate in the following.

The main objective of this study is to compare (hybrid) cluster techniques for cognitive mapping with traditional ‘intellectual’ subject-classifications schemes.

In a first study by authors related to the current work, the pilot study of Glenisson, Glänzel, & Persson (2005), further extended and confirmed by Glenisson, Glänzel, Janssens et al. (2005), full-text analysis and traditional bibliometric methods were serially combined to improve the efficiency of the individual methods. It was clear that clusters found through application of text mining provided additional information that could be used to extend and explain structures found by bibliometric methods, and vice versa. However, the integration was still limited to serial combination.

All textual content was indexed with the Jakarta Lucene platform (Hatcher & Gospodnetic, 2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme reviewed by Baeza-Yates & Ribeiro-Neto (1999). Stop words were neglected during indexing and the Porter stemmer was applied to all remaining terms from titles, abstracts, and keyword fields. The resulting term-by-document matrix contained nine and a half million term dimensions (9,473,061), but by ignoring all tokens that occurred in one sole document, only 669,860 term dimensions were retained. Those ignored terms with document frequency equal to one are useless for clustering purposes.

The dimensionality was further reduced from 669,860 term dimensions to 200 factors by Latent Semantic Indexing (LSI) (Berry, Dumais, & O’brien, 1995; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), which is based on the Singular Value Decomposition (SVD).

Text-based similarities were calculated as the cosine of the angle between the vector representations of two papers (Salton & Mcgill, 1986).

For simplicity and efficiency, the method used to summarise the subject of a field or cluster is based on selecting the terms with the highest mean TF-IDF weights over all journal papers in the field or cluster, where the IDF factor is calculated on the complete term-by-paper matrix (more than six million papers).

For example, Treeratpituk and Callan (2006) automatically select and assign a few concise labels to hierarchical clusters by combining statistical features from the cluster, parent cluster, and a corpus of general English into a descriptive score.

Geraci, Maggini, Pellegrini, and Sebastiani (2008) label clusters by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure, and by looking within the titles of Web pages for the substring that best matches the selected top-scoring words.

The similarities Sij used for clustering were found by calculating the cosine of the angle between the pair of vectors containing all symmetric journal cross-citation values between the two respective journals (i and j) and all other journals (i.e., row or column of the matrix C):

The journal cross-citation graph is also analysed to identify important high-impact journals. We use the PageRank algorithm (Brin & Page, 1998) to determine representative journals in each cluster. Besides, the graph can also be used to evaluate the quality of a clustering outcome.

In order to subdivide the journal set into clusters we used the agglomerative hierarchical cluster algorithm with Ward’s method (Jain & Dubes, 1988).

In general, the number of clusters is determined by comparing the quality of different clustering solutions based on various numbers of clusters. Cluster quality can be assessed by internal or external validation measures. Internal validation solely considers the statistical properties of the data and clusters, whereas external validation compares the clustering result to a known gold standard partition

This compound strategy encompasses observation of a dendrogram, text- and citation-based mean Silhouette curves, and modularity curves. Besides, the Jaccard similarity coefficient is used to compare the obtained results with an intellectual classification scheme.

Up to a multiplicative constant, modularity measures the number of intra-cluster citations minus the expected number in an equivalent network with the same clusters but with citations given at random. Intuitively, in a good clustering there are more citations within (and fewer citations between) clusters than could be expected from random citing.

In Fig. 8, we use the Jaccard index to compare each cluster with every field from the intellectual ESI classification, in order to detect the best-matching fields for each cluster.

Nowadays two ISI systems are widely used, in particular, the ISI Subject Categories, which are available in the JCR and through journal assignment in the Web of Science as well, and the Essential Science Indicators (ESI).

While the first system assigns multiple categories to each journal and is too fine grained (254 categories) for comparison with cluster analysis, the ESI scheme is forming a partition (with practically unique journal assignment) and the 22 fields are large enough. ... This subject-classification scheme is in principle based on unique assignment; only about 0.6% of all journals were assigned to more than one field over a 5-year period.

Fig. 1 presents the evaluation of the 22 ESI fields based on the cross-citation- (left) and text-based (right) Silhouette values (see Section 3.3.3). Since the ESI fields form a partition, this approach allows to evaluate their consistency as if the fields were results of a clustering procedure. Multi-, interand cross-disciplinarity of journals can certainly affect the results.


Several fields seem not to be coherent enough from both perspectives (i.e., the cross-citation and textual approach). Above all, the Silhouette values of field #2 (Biology and Biochemistry), #4 (Clinical Medicine), #7 (Engineering), #19 (Plant and Animal Science) and #21 (Social Sciences) substantiate that at least five of the 22 fields are not sufficiently coherent.

Simultaneously to the above validation, the textual approach also provides the best TF-IDF terms – out of a vocabulary of 669,860 terms – describing the individual fields. These terms are presented in Table 2. Although these terms already provide an acceptable characterisation of the topics covered by the 22 fields, considerable overlaps are apparent between pairs of fields, respectively: Engineering (#7) and Computer Science (#5), Chemistry (#3) and Materials Science (#11), Plant and Animal Science (#19) and Environment/Ecology (#8), as well as Biology and Biochemistry (#2), Molecular Biology and Genetics (#14) and Clinical Medicine (#4). In addition, the terms characterising the social sciences (#21) reflect a pronounced heterogeneity of the field.



The structural map of the 22 ESI fields based on cross-citation links is presented in Fig. 2. For the visualisation we used Pajek (Batagelj & Mrvar, 2003). The network map confirms the strong links we have found based on the best terms between fields #2 and #14, #3 and #11, #5 and #7, and #8 and #19, respectively.

In Table 3 we compare the quality of the partition of 22 ESI fields with the quality of the 22 clusters resulting from citation-based, text-based and hybrid clustering.


The cluster dendrogram shows the structure in a hierarchical order (see Fig. 4). We visually find a first clear cut-off point at three clusters, a second one around seven, and 22 clusters also seemed to be an acceptable/appropriate number.

The number of three clusters results in an almost trivial classification. Intuitively, these three high-level clusters should comprise natural and applied sciences, medical sciences, and social sciences and humanities.

The solution comprising of seven clusters results in a non-trivial classification. The best TF-IDF terms (see Table 5) show that three of these clusters represent the natural/applied sciences, whereas two classes each stand for the life sciences and the social sciences and humanities. This situation is also reflected by the cluster dendrogram in Fig. 4. A closer look at the best TF-IDF terms reveals that the social-sciences cluster (#1 of the 3-cluster solution) is split into the cluster #1 (economics, business and political science) and #6 (psychology, sociology, education), the life-science cluster (#3 in the 3-cluster scheme) is split into clusters #3 (biosciences and biomedical research) and #7 (clinical, experimental medicine and neurosciences) and, finally, the sciences cluster #2 of the 3-cluster scheme is distributed over three clusters in the 7-cluster solution, particularly, the cluster comprising biology, agriculture and environmental sciences (#2), physics, chemistry and engineering (#4) as well as mathematics and computer science (#5).

The social-sciences and humanities clusters form two groups that are each strongly interlinked; one consists of clusters #1, #6, #14 and #22 with focus on humanities, economics, business, political and library science, the other one comprises #9, #11 and #21 with sociology, education and psychology. This is in line with the hierarchical structure shown in Fig. 4. These two groups correspond to the two social-sciences clusters in the 7-cluster solution (cf. Section 4.4).

On the basis of the most important TF-IDF terms (see Table 6) we can assign clusters #2, #15 and #19 to geosciences, environmental science, biology and agriculture, which, in turn, form a larger group corresponding to the first of the three ‘‘megaclusters” in the 7-cluster solution.

These science clusters form two groups, #4, #20 and #5 form one group of chemistry, physics and engineering, while #8 and #18 form the third group comprising mathematics and computer science.

Here we have a biomedical and a clinical group. These two groups are in line with the hierarchical structure of the dendrogram in Fig. 4 but less clearly distinguished in the graphical network presentation (Fig. 6). Nonetheless, the terms provide an excellent description for at least some of the medical clusters: cluster #7 stands for the neuro- and behavioral sciences, #3 for bioscience, #10 for the clinical and social medicine, #13 microbiology and veterinary science, #12 non-internal medicine, #16 hematology and oncology and #17 cardiovascular and respiratory medicine. According to the dendrogram clusters 3, 13, 16 and clusters 7, 10, 12, 17 form one larger cluster each. On the basis of the best terms, we can characterise these groups as the bioscience–biomedical and the clinical and neuroscience group, respectively.

In this subsection we compare the structure resulting from the hybrid clustering with the ESI subject classification. This comparison is based on the centroids of the clusters and fields. The centroid of a cluster or field is defined as the linear combination of all documents in it and is thus a vector in the same vector space. For each cluster and for each field, the centroid was calculated and the MDS of pairwise distances between all centroids is shown in Fig. 7.

In Fig. 8, we use the Jaccard index to determine the concordance between our clustering solution and the ESI Scheme by comparing each cluster with every field, in order to detect the best-matching fields for each cluster. The darker a cell in the matrix, the higher the Jaccard index, and hence the more pronounced the overlap between the corresponding cluster and ESI field.

If clustering algorithms are adjusted or changed, one can observe the following phenomenon. Some units of analysis are leaving clusters they formerly belonged to and end up in different clusters. This phenomenon is called ‘migration’. We can distinguish between ‘good migration’ and ‘bad migration’.

‘Good migration’ is observed if the goodness of the unit’s classi- fication improves, otherwise we speak about ‘bad migration’. We can also apply this notion of migration to the comparison of clustering results with any reference classification. In the following we will use the ESI scheme as reference classification.

Out of 8305 journals under study, there were more than one third, namely, 3204 journals that were not assigned to the cluster which best matches their ESI field. As already mentioned above, we call these journals ‘migrated journals’.

‘Good migrations’ are observed if journals improved their Silhouette values after migration. Based on their titles and scopes (not shown), apparently they should indeed be assigned to the cluster to which they have moved.

Although the Silhouette and modularity values substantiate a more coherent structure of the hybrid clustering as compared with the ESI subject scheme, not all clusters are of high quality. Problems have been found, for instance, in clusters #1 and #12 where interdisciplinarity and strong links with other clusters distort the intra-cluster coherence.


2016年7月7日 星期四

Thijs, B., Zhang, L., & Glänzel, W. (2015). Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. Scientometrics, 105(3), 1453-1467.

Thijs, B., Zhang, L., & Glänzel, W. (2015). Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. Scientometrics105(3), 1453-1467.

Thijs, B., Zhang, L., & Glänzel, W. (2013, January). Bibliographic coupling and hierarchical clustering for the validation and improvement of subject-classification schemes. In Proceedings of ISSI (pp. 237-249).

本研究利用書目耦合(bibliographic coupling)資訊將收錄於Web of Science資料庫內的期刊分群,利用二次相似性(second order similarities)改善書目耦合資訊產生的相似性矩陣過於疏鬆的問題,再以dendrogram及silhouette測量等資訊決定由Ward的凝聚法(Ward’s agglomeration method)產生集群的數目,最後比較集群結果與Glänzel & Schubert (2003)提出的以期刊為基礎的主題分類架構(the journal-based subject-classification scheme),了解兩者間的對應情形,並決定集群的命名。

使用書目耦合的優點是因為需要的資料都已呈現在論文或資料庫上,計算論文(即本文所謂的publications)和期刊間的連結不會有延遲,並且建立連結後將會持續保持一致,不隨時間改變。然而,如同其他使用引用資訊的方法,相關的論文或期刊並無法共同具有所有的參考文獻,在相似性矩陣上無可避免地會有大量的0出現(Janssens, 2007; Janssens et al., 2008),產生極為大量的單一個體(singletons),並影響後續集群分析的品質。過去解決這個問題的一種做法是將引用文獻與詞語相似性的結果混合,例如Janssens et al. (2008);另一種的做法是以二次相似性產生相似性矩陣,例如Janssens (2007)、Ahlgren & Colliander (2009)與 Thijs et al. (2013)等研究。

本研究即是利用二次相似性對Web of Science資料庫內的期刊分群,分析的期刊為2006到2009年間出版100筆論文或以上的期刊,共8282種。作法如下:

1. 以Salton提出的餘弦測量法(cosine measure)產出一次相似(first order similarity),然後以一次相似矩陣再次進行餘弦測量法,產生二次相似性。經過二次相似性的計算後,有10種期刊沒有與其他期刊相連,因此加以移除,因此剩餘的期刊網路上共有8272種。

2. 以Ward的凝聚法產生階層集群,並利用dendrogram及silhouette測量推測可能的集群數目,在本研究由上到下分別有6、14及24種集群數目。silhouette測量方法是對每一種期刊計算一個介於-1和1之間的silhouette數值,正值代表該期刊被分配到適當的集群中。然後將期刊依照其群集分組並以silhouette數值大小排序,產生的圖形可以表示各集群的分群品質,如果在正值的部分有較大的面積,換言之,這個集群具有較多的期刊具有適當的分配,則代表有較好的集群劃分結果。

為了檢驗分群的效果,產生的集群結果與Glänzel & Schubert (2003)提出的以期刊為基礎的主題分類架構進行比較,並且在每一個集群上找出具有代表性的核心期刊(core journals)來分析結果的每個期刊集群。圖五是以網絡來表示集群之間的關係,在圖上可以發現藝術與人文(Arts and Humanities)遠離其他集群,神經科學和行為科學(Neurosciences & Behaviour)介於社會科學和生命科學之間,化學則處於生物科學(Biosciences)、醫學和物理學的中間,較特別的是 一般、區域與社區議題(General,Regional and Community Issues)與生命科學之間有很強的連結。



此外,並且計算14個集群與Glänzel & Schubert (2003)的15個學科(排除Multidiscipline下的期刊)之間的Jaccard指標,呈現為表三。

除了與Glänzel & Schubert (2003)的期刊架構比較以外,本研究也將分群的結果與 ESI (essential science indicators)的類別比較,然而卻發現ESI的劃分與本研究分類結果的結構並不一致。

An attempt is made to apply bibliographic coupling to journal clustering of the complete Web of Science database. Since the sparseness of the underlying similarity matrix proved inappropriate for this exercise, second-order similarities have been used.

Cluster labelling was made on the basis of the about 70 subfields of the Leuven-Budapest subject-classification scheme that also allowed the comparison with the existing two-level journal classification system developed in Leuven. The further comparison with the 22 field classification system of the Essential Science Indicators does, however, reveal larger deviations.

The issue of subject classification and the creation of coherent journal sets has been a major topic in our field since the seventies (see e.g., Narin et al., 1972; Narin, 1976).

The development of computerised methods and the availability of large datasets have shifted the attention from mapping small or single disciplines to the generation of global science maps (Garfield, 1998).

Jarneving (2005) applied bibliographic coupling to map and to analyse the structure of an annual volume of the Science Citation Index.

Janssens et al. (2008; 2009) used a combination of cross-citations and a lexical approach to map journals. Zhang et al. (2010) validated this approach.

The advantage of bibliographic coupling is that there is no delay for the calculation of the link between publications or journals as all data needed are present upon publication or indexing in the database. This also means that link between documents, once established will remain constant over time.

This disadvantage is a result of the very sparse nature of the link matrix (Janssens, 2007; Janssens et al., 2008). The overwhelming number of document pairs does not share any reference at all and thus a large number of zeros occur in the similarity matrix. This deteriorates the quality of the subsequent clustering and may result in an unrealistic large number of singletons (cf. Jarneving, 2005).

As cross-citation data suffers from the same problem, Janssens et al. (2008) introduced a hybrid approach, where they combined citation-based with lexical similarities.

Another solution to overcome the sparseness problem is the use of second order similarities (Janssens, 2007; Ahlgren & Colliander, 2009; Thijs et al., 2013).

A set of journals was compiled from the Web of Science database (SCI-Expanded, SSCI and AHCI). All journals covered in this database between 2006 and 2009 with at least 100 publications in this period are taken into account. This resulted in a set of 8282 journals.

To express the strength of a link between two journals we calculated a first order similarity based on Salton’s cosine measure. The mathematical derivation and interpretation of this similarity measure in the framework of a Boolean vector space model can be found in (Sen & Gan, 1983; Glänzel & Czerwon, 1996).

As bibliographic coupling tends to produce very sparse similarity matrices we applied a second order similarity to reduce this effect. While the first-order similarity is based on the angle between two reference vectors, the second-order similarity is calculated as the cosine of the angle of two vectors holding the first order similarity between two journals.

After the calculation of the second-order similarities, ten journals were removed from the set as they appeared to be singletons without any link to the other journals in the set. The network thus included 8272 journals in total.

Hierarchical clustering with Ward’s agglomeration method was used to create a hard clustering of all the journals.

This method does not provide any automated optimum number of clusters so that the decision was made on the basis of the dendrogram and the silhouette statistics (Rousseeuw, 1987).

Three different levels were chosen. The dendrogram holds strong arguments for a six cluster partitioning while the silhouette plot shows a first peak at 7 clusters. For the highest hierarchical level in the following analysis we use the six cluster solution. At a lower level, the silhouette plot suggests the solutions with 14 and 24 clusters, respectively.

For the evaluation of the specific cluster solution we can rely on the silhouette graphs presented in Figure 4. Each graph presents the silhouette values of the journals in the respective cluster. For each journal a silhouette value is calculated. These values range between 1 and -1 where positive values indicate an appropriate clustering of the journals. Journals are grouped by cluster and ordered from highest silhouette value to lowest. As a consequence the graph gives a good profile of the quality of each cluster. A larger area at the positive side of the vertical axis thus represents a better partitioning.

In order to find an acceptable solution, we decided to use the journal-based subject-classification scheme developed in Leuven (Glänzel & Schubert, 2003). This solution proved most advantageous since both clustering and classification scheme are based on journal assignment. Table 1 presents the hierarchical structure of the three level partitioning. For each cluster the number of journals is mentioned. The labels for the higher levels can be deduced from the lowest level. These labels are taken from the Leuven classification system . The label from the most prominent subject category has been assigned to the corresponding cluster.

Another way to describe the cluster is by using core journals. This notion can be analogously defined as core documents introduced by Glänzel & Czerwon (1996) and extended by Glänzel & Thijs (2011).

In this particular application, a core journal can be identified as journal with at least n links with other journals of at least a given strength r on the second order similarity measure. For the identification of core journals in each cluster we set the number of strong links to at least half the set of journals in the cluster.

As we are using second order similarities this choice is not unreasonable. The value of the strength is chosen such that 12 journals within each cluster comply with both criteria. This means that for more dense clusters the choice of appropriate r-value is higher than in clusters where the journals are not as strongly linked.

Above all, chemistry is at each level a separate cluster. One might expect that at the highest level, chemistry is merged with Physics but we found different patterns.

The second noteworthy observation concerns cluster 17 (Public Health & Nursing). This is a cluster within the ‘Psychology – Neuroscience’ cluster at the highest, six-cluster level. In other partitions or subject classification systems this is attributed to Non-Internal Medicine.

To visualise relations between the 24 clusters we created an additional map. Figure 5 shows these relations.



Despite these multiple assignments we used the Jaccard Index to measure the concordance between the two journal The results are presented in Table 3.

Arts and Humanities is an outlier, Neurosciences & Behaviour acts as a bridge between Social Sciences and Life Sciences, Chemistry takes a central position between Biosciences, Medical Sciences and Physics.Most striking observation in their map is the position of General,Regional and Community Issues which is strongly linked with the Life Science fields.

A 24 cluster solution can be compared with the 22 categories from the classification of Thomson Reuters’ Essential Science Indicators (ESI).

Janssens et al. (2009) showed very low mean silhouette values for the ESI category system in a space with respectively textual distances, cosine similarities of cross-citation vectors and combined distances.

Also in the present study, not all clusters have a unique counterpart in the ESI classification system and vice versa (cf. Janssens et al., 2009). Notably, the ESI fields clinical medicine and engineering, mathematics and social sciences, general are almost uniformly spread over numerous clusters.

Based on this analysis we have to conclude that the segmentation of journals in the ESI categories is not supported by the structure found with bibliographic coupling between journals.

Given this rather weak association between the clustering based on cross-citations and bibliographic coupling, it is a legitimate question to ask which of both methodologies is performing best. A comparison of the mean silhouette values and the silhouette values within each cluster reveals that the methodology presented in this paper results in a more consistent solution.

The 15 cluster solution of cross-citation has a value of 0.04 while the bibliographic coupling results in a value of 0.13.

The main advantage of this method is that clustering can be made as soon as a new database volume is available. The only issue is the lacking cluster labelling that cannot directly be obtained from the method. As a substitute, intellectual classification schemes can be used as reference system. Cluster labelling was made on the basis of the Leuven-Budapest subject-classification scheme that also allowed the comparison with the existing two-level journal classification system developed in Leuven.

The further comparison with the 22 field classification system of the Essential Science Indicators does, however, revealed some striking deviations. These concerned, above all, the fields of clinical medicine, engineering, mathematics and the social sciences. New developments in computer science, neuroscience and psychology as well as in public health (cf. Glänzel & Thijs, 2011) do certainly contribute to such growing deviation.

The main objective of this study was to analyse whether the proposed methodology is appropriate for multi-level journal clustering and to what extent the solutions fit in the framework of traditional subject classification. Further comparison with other solutions such as cross-citation and hybrid methods will be part of future research.

Jeong, Y. K. & Song, M. (2016). Applying content-based similarity measure to author co-citation analysis. In Proceedings of iConference 2016.

Jeong, Y. K. & Song, M. (2016). Applying content-based similarity measure to author co-citation analysis. In Proceedings of iConference 2016.

本研究利用引用文獻出現文句內容的相似性來測量作者的主題相關性(topical relatedness)。傳統的作者共被引分析(Author co-citation Analysis, ACA)做法是利用參考文獻裡被引用作者的共被引頻率(White and Griffith, 1981),然後利用Pearson相關係數 (Pearson correlation coefficient)或是 Salton提出的餘弦相似性測量作者的相似性,在書目計量學研究裡已經廣泛運用於確認與追蹤學科的知識結構(the intellectual structure of an academic discipline) (He & Hui, 2002)。然而這種做法並未考慮引用的內容,Jeong, Song, & Ding, (2014)與 Zhao & Strotmann (2014)則利用全文裡提到的作者並將有關的內容加入ACA的計算。

本研究認為累積被引作者出現的文句能夠代表作者的研究領域,因此利用JASIST的全文資料,剖析HTML,取出論文的後設資料(題名、作者姓名、出版年、DOI與摘要)、引用資訊(引用文句與參考文獻索引)以及參考資訊(作者姓名、出版年、題名與期刊)。在這個研究裡,共使用2003年1月到2015年6月的1910篇論文,合計77,408筆參考文獻。將引用文句與一般文句分開,連結文句內的參考文獻索引與參考資訊,選取100位最多被引用的作者,進行傳統的ACA以及本研究提出的新方法。本研究的新方法利用Mikolov et al., (2013)提出的Word2Vec 模型 (Word2Vec models),根據參考文獻出現的引用文句,找出作者間的相似性。Word2Vec 模型以大量的文本為基礎,利用類神經網路方法( neural network approaches),找出詞語之間的語意關係,將每一個出現於文句的詞語轉換成向量,使得這些向量之間的相似性能夠保持詞語在語意上的關係。本研究將被引用的作者姓名視為是引用文句中出現的詞語,測量作者間在研究主題的相似性與合作關係。

表2是傳統的ACA方法與本研究的方法分別找出的10組最相似的作者,本研究的方法找出10組最相似的作者中有一半是具有合作關係的作者。

另外,將兩種方法產生的作者關係分別繪製成網路圖,節點代表作者,利用PageRank決定的節點大小,節點的遠近由作者間的相似性決定,並且以Blondel, Guillaume, Lambiotte, & Lefebvre (2008)提出的社群偵測(community detection)方法進行分群。圖三與圖四分別是傳統ACA與本研究提出方法的結果。

圖三上可明顯地看到所有的作者分為兩群,依據社群偵測的分群結果,左邊的作者可再分為兩群:最左邊紅色的一群為研究資訊尋求行為(information seeking behavior)的作者,紫色的一群則與資訊檢索(information retrieval)研究有關,右邊綠色的一群則是研究書目計量學(bibliometrics)的作者。介於左右兩大群體的作者分別有兩位:Borgman和Salton。這兩位都是資訊科學領域傳統上會經常引用的作者。



在以Word2Vec方法產生的作者網路上,與資訊檢索有關的作者群組位於左方,包括上方的資訊尋求行為以及下方的文件檢索(document retrieval)兩個群組,書目計量學在圖四上則分為兩個有關的群組,一個主要包含作者分析(author analysis),另一則是期刊引用分析(journal citation analysis)與評鑑指標(evaluation indicator)。與圖三不同的是,圖四上的群組彼此間都有連結,並且圖形上更具體地呈現次學科(sub-disciplines)以及重要的作者。

Unlike other ACA studies, we used citing sentences to reflect topical relatedness of authors.

In  our  research,  we extended  traditional  approaches by  adopting Word2Vec, one  of  deep learning methods, to measure author similarity.

We also conducted in-depth network analysis of author maps.

The results of Word2Vec-based author map revealed more specific sub-disciplines and the important authors in perspective of topical influence than traditional approach does.

Author co-citation Analysis (ACA), which was introduced by White and Griffith (1981), has been widely used in bibliometrics researches to identify and trace the intellectual structure of an academic discipline (He & Hui, 2002). In ACA, traditional approaches relied on the co-citation frequency of cited authors in the reference section.

Thus, one of the main topics in ACA was methodological discussion of what kind of measure is appropriate and relevant for calculation of author similarities (Leydesdorff, 2005; van Eck & Waltman, 2007). Existing approaches based on co-citation frequencies such as Pearson correlation coefficient and Salton’s cosine similarity, however, do not capture the citation content.

Thus, some recent researches used the full-text to obtain the topical relatedness between the cited authors (Jeong, Song, & Ding, 2014; Zhao & Strotmann, 2014). They analyzed the authors mentioned in the full-text and incorporated contents related with cited authors into ACA.

In that sense, cumulated citing sentences of cited authors are able to well represent the cited researches and cited authors’ research areas. In addition, these citing sentences are particularly useful for summarization of a research document.

Figure 1 shows the overall system flow of our approach.


For content analysis, however, we collected full-text research articles of JASIST in HTML format. Through the HTML parsing process, we extracted the metadata (title, author name, year, DOI and abstract), citation information (citing sentence, and reference id), and reference information (author name, year, title, and journal).

To compare our method to traditional ACA, we computed author-pairs in both approaches. In Word2Vec-based method, the full-text data, first, are splitting into sentences. In second step, matching the citing sentences with reference id in reference section, we separated the citing sentences and other general sentences. Then, citing sentences are preprocessed in the following steps: tokenization, POS tagging, lemmatization of the tokenized sentence, and stop word removal.

From these data, we trained Word2Vec model for calculating author similarity and generated author-author similarity matrix. To compare the previous research, traditional author counting approach, we also construct co-citation matrix based on citation counts. Since we preprocessed full-text including all reference information, these matrices considered all cited authors.

To evaluation, we selected top 100 authors which are highly cited in both methodology, and conduct network analysis through visualizing author maps.

The data was gathered from 1,910 full-text articles in the JASIST digital library over 12 years (from January 2003 to June 2015). The 1,910 collected documents have 77,408 references. We extracted elements from the full-text article: 1) citing sentences from the body of the article, 2) the references information, and 3) all cited authors. Table 1 shows the basic statistics of collected data.

Word2Vec models, one of the neural network approaches, are able to carry semantic meanings and turns text into a numerical form that deep-learning nets can understand (Mikolov et al., 2013). Based on a large amount of plain text, Word2Vec trains relationships between words automatically.

Word2Vec spatially encoded a word meaning and the relationship between words, which was originally applied to word clustering or synonym detection (Wolf et al., 2014). We applied Word2Vec into author similarity measure regarding cited author names as a word in plain text.

Since authors’ oeuvre was represented as the citing sentences in research articles, the Word2Vec-based method could consider various topics of the author.

In the proposed approach, however, the author names are also trained as words in a same citing sentence. Therefore, the similarity between two authors in the Word2Vec-based method reflects both topical relatedness and collaborations.

Table 2 shows top 10 pairs by the traditional ACA method (Pearson correlation based similarity) and the Word2Vec based approach respectively. About the half of pairs resulted from the Word2Vec approach are the co-author relationship.

This results imply that the proposed approach enables to detect wider range of author pairs in perspective of topical relatedness and grasp more diverse research fields of information science.

To examine whether there are structural differences in two measures of author similarity, we constructed two author networks with top 100 authors. For network visualization, we used PageRank (Brin & Page, 1998) to determine the node size and also adopted the modularity algorithm (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008) for the community detection.

Figure 3 illustrates roughly two parts that consist of information retrieval and bibliometrics, two major research areas in JASIST. The author group of information retrieval (purple) along with information seeking behavior (red) is located at the left side, and the author group related with bibliometrics is located at the right side.

There are only two authors located between two groups (Borgman and Salton), who are traditionally cited authors in the information science field. Borgman studied various topics including information retrieval and scholarly communication and wrote the important books that had won the best information science book from ASIST. Salton’s works also received a lot of citations for a long time in the field of information science.


The author group related with information retrieval in the left side of the network is split into information seeking behavior (blue) located in the upper side of the network and document retrieval (yellow) located at the bottom side of the network. The group related to bibliometrics is also separated into two parts: (1) a cluster (green) including author analysis and (2) journal citation analysis and evaluation indicator (red).

Unlike Figure 3, the communities in the network are connected to each other. Brin is connected with both document retrieval and citation analysis communities. This may be attributed to the fact that the PageRank, developed by Brin and Page (1998), is used in information retrieval and also studied in network analysis to compute node centrality.

In bibliometrics, PageRank is adopted as one of the centralities in citation networks (Ding, Yan, Frazho, & Caverlee, 2009). Ingwesen, who is located between information retrieval and bibliometrics, studied information retrieval in earlier works, he extended the research area to network analysis such as webometrics.

It implies that the authors linked by citation are topically grouped in the Word2Vec-based author network.