2015年4月21日 星期二

Tseng, Y.-H. and Tsay, M.-Y. (2013) Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR. Scientometrics, 95, 503-528. doi: 10.1007/s11192-013-0964-1.

Tseng,  Y.-H. and Tsay, M.-Y. (2013) Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR. Scientometrics, 95, 503-528. doi: 10.1007/s11192-013-0964-1.

近幾十年來,發展出許多科學計量分析技術,包括為了群集(clustering)書目資料所需的各種相似度(similarity)計算技術,如共被引(co-citation)、書目耦合(bibliographic coupling)與詞語共現分析(co-word analysis),這些技術的比較分析可參見Yan and Ding (2012)。並且有很多可以在網路上自由下載使用的軟體工具製作並包裝這些技術,提供科學計量分析應用,知名的軟體工具如CiteSpace (Chen 2006, Chen et al. 2010)、Sci2 Tool (Sci2 Team 2009)、VOSviewer (Van Eck and Waltman 2010)、BibExcel (Persson 2009)及Sitkis (Schildt and Mattsson 2006),這部分的分析則可參見Cobo et al. (2011)。本研究包含兩個部分:提出包含一系列利用書目計量資訊進行群集與映射(mapping)技術的科學計量分析軟體工具集 CATAR,並且將此工具集應用於圖書資訊學(library and information science, LIS)領域後,希望能夠利用期刊群集的結果,確認與分析次領域,以及建議適合研究評估(research evaluation)用途的LIS期刊集合。

Åström (2002)從領域概念的視覺化研究獲得一個結論:期刊的選擇確實影響研究領域如何被知覺與定義,也就是研究領域的界定(delineation)與期刊的選擇有密切關係。已經有許多的研究對圖書資訊學進行次領域界定,而這些研究大多參考ISI的JCR主題分類中與圖書資訊學最相關的類別IS&LS(Information Science and Library Science)。IS&LS類別下並不只包含圖書資訊學的相關期刊,這個類別涵蓋兩個密切相關的領域資訊科學(Information Science)和圖書館學(Library Science),此一範圍與圖書資訊學有些微不同。根據Leydesdorff (2008),JCR主題分類以期刊的題名、引用模式(citation patterns)等等做為標準進行分類,但是這個分類結果與從資料庫本身的引用資料所產生的網路上的主要成分(principal components)得到的分類結果並不十分相符。因此次領域界定研究大多經過人為的挑選做為分析資料的期刊,並沒有完整收錄IS&LS主題下的所有期刊。

進行次領域界定時常使用的技術包括:利用共被引分析比較一對項目,利用凝聚式階層群集(agglomerative hierarchical clustering, AHC)將項目分群產生樹狀圖(dendrogram),利用多維尺度(multi-dimensional scaling, MDS)產生視覺化的二維或三維映射圖。若干重要的研究如:Åström (2002)從圖書資訊學重要期刊中選取1135篇出版在1998到2000年的文章,利用BibExcel軟體工具進行作者共被引(author co-citation)以及關鍵詞共現分析,並產生MDS映射圖,52位高被引作者的共被引產生三個群集:"硬"資訊檢索(hard information retrieval)、"軟"資訊檢索(soft information retrieval)以及書目計量學(bibliometrics),47個較常出現的關鍵詞則分為圖書館學(library science,LS)、資訊檢索(information retrieval,IR)及書目計量學。Åström (2002)認為作者共被引分析沒有出現圖書館學的原因可能與圖書館學研究的出版管道有關,如果引用的資料像是書籍或地區期刊沒有出現在JCR,圖書館學作者便無法出現在引用為基礎的排名上。Åström (2007)對55種在JCR 2003主題類別下的期刊,選擇21種圖書資訊學相關期刊的13605篇文章進行文件共被引分析,在從1990到2004年的三個時段發現圖書資訊學可分為資訊計量學(informetrics)和資訊搜尋與檢索(information seeking and retrieval)兩個穩定的次領域,而隨著全球資訊網的普及,網路計量學(webometrics)在兩個次領域上都成為主要的研究議題。Jassen et al. (2006) 對2002到2004年五種圖書資訊學相關期刊的938篇文章,應用一系列的全文分析技術以及MDS和AHC,將938篇文章分為六個群集:兩個群集與書目計量學有關、一個群集為IR、一個包含一般議題、另兩個較小但愈來愈重要的群集分別是網路計量學和專利分析(patent analysis)。Moya-Anegon et al. (2006)從24種較有影響力的期刊中選擇17種期刊,排除將資訊科學(information science, IS)應用到特定技術或知識領域(例如:醫學、地理學、電訊傳播等),從17種期刊引用的參考文獻,對77位最常被引用的作者和73篇最常被引用的期刊進行共被引分析,映射使用的技術包括MDS和AHC以及自組織映射圖(self-organizing map)。作者共被引分析的結果產生六個次領域:科學計量學、引用分析、書目計量學、"軟"(認知導向)資訊檢索、"硬"(演算法導向)資訊檢索以及傳播理論(communication theory)。而期刊共被引分析的結果則有四個群集:IS、LS、科學研究(science studies)以及管理學(management)。在期刊共被引分析的科學研究大致上可以對應為作者共被引分析的科學計量學、引用分析、書目計量學,IS為"軟"資訊檢索和"硬"資訊檢索。如Åström (2002)同樣的原因,LS沒在作者共被引分析的結果當中。Waltman et al. (2011)以JASIST為種子,選擇與該期刊共被引較多的期刊,連JASIST共48種,進行期刊的書目耦合(bibliographic coupling)分析,並且利用VOSviewer呈現視覺化結果,共分為LS、IS以及科學計量學等3個次領域。Milojevic et al. (2011)使用詞語共現分析探討1998到2007年出版的16種期刊上的10344篇文章,16種期刊根據Nisonger and Davis (2005) 的研究所挑選,分析100個文章題名上最常出現的詞語,進行共現分析,並以AHC歸類,結果三個主要群集為LS、IS以及書目計量學/科學計量學。

Åström (2002)以關鍵詞的共現分析所得到的結果包括LS次領域,但作者共被引分析所得到的映射圖上並沒有產生這個次領域。Moya-Anegon et al. (2006)的期刊共被引分析與作者共被引分析也略有不同,期刊共被引分析的結果上有作者共被引分析沒有的LS和管理學兩個次領域,反之,作者共被引分析的結果上則可以發現期刊共被引分析沒有的傳播學理論(communication theory)。一般認為這和作者引用的行為有關,LS作者的引用次數大多沒有達到分析的門檻,因此無法在上述兩個研究的作者共被引分析結果上呈現。

Ni et al. (2012)從JCR的IS&LS類別下的61種期刊,排除3種非英語的期刊,將選取的58種期刊進行場域-作者耦合(venue-author coupling)、期刊共被引分析、詞語共現分析、期刊連結(journal interlocking)等四種分析。分析的結果再進行MDS與AHC分析,四種方式所得到一致的次領域包括:管理資訊系統(managment information systems, MIS)、IS、LS和特殊化群集(specialized clusters),並且在四種方法所得到MDS映射的圖形上都可以發現MIS與其他群集分離,Ni and Ding (2010)與Ni and Sugimoto (2011)建議JCR上的圖書資訊相關期刊應進行適當的重組。

本研究(Tseng and Tsai 2013)應用的資料範圍為2000到2004與2005到2009在Web of Science 的Journal Citation Report中 Information Science & Library Science (IS&LS)主題分類下的所有期刊,在前期(2000~2004年)共50種,後期(2005~2009年)共66種。本研究的分析程序採用Borner et al. (2003)整理的一般工作流程,步驟包括:1) 資料蒐集(data collection);2)文本分段(text segmentation);3)相似性計算(similarity computation);4)多階段群集(multi-stage clustering);5)群集標名(clustering labeling);6)視覺化(visualization);7)面向分析(facet analysis)。這些步驟中所需的技術都已經整合到軟體工具CATAR(Content Analysis Toolkit for Academic Research, http://web.ntnu.edu.tw/~samtseng/CATAR/)上。在計算文件間的相關性時,本研究以一種期刊做為一個文件,所有論文引用的期刊做為文件的特徵,然後利用Dice係數(Salton 1989)計算期刊相似性,例如兩種期刊X與Y,R(X)與R(Y)分別是它們引用的期刊,它們之間的相似性計算為Sim(X, Y) = 2 ∙ |R(X)∩R(Y)|/(|R(X)|+R(Y)|)。也就是利用書目耦合計算期刊之間的相似性。期刊的群集則是利用完全連接階層群集法(complete-linkage hierarchical clustering)。首先將每個文件視為一個群集,然後將一對最相似的群集合併起來,產生一個較大的群集,然後重複進行上面的步驟,而兩個群集的相似性定義為兩個群集間最小的文件相似性,如果相似性超過某個預先設定的閾值,便將兩個群集合併,一直到無法再產生合併為止。此外,本研究採用Silhouette指標(Ahlgren and Jarneving 2008; Rousseuw 1987; Jassen et al. 2006)。

此一研究的資料包含JCR的IS&LS主題下的期刊,分為2000-2004年與2005-2009年兩個時期,前一個時期包含50種期刊,9546筆論文資料;後一時期則有66種期刊,11471筆論文資料。從群集結果的樹狀圖(dendrogram)和MDS映射的結果顯示,IS&LS主題下的期刊在兩個時期都有IR、MIS、科學計量學、學術圖書館(academic library)、醫學圖書館(medical library)、館藏發展(collection development),以及開放取用(open access)和地區圖書館(regional library)兩個後期出現並且較小的群集。並且MIS群集的期刊在知識基礎(intellectual base)上與IS&LS主題的其他期刊分離,表示這群集下的期刊具有較特殊的引用模式。本研究以期刊的書目耦合進行分析,從期刊知識基礎(intellectual base)得到MIS群集與其他分離的研究結果,與Ni et al. (2012)利用期刊共被引分析、期刊連結、術語使用(terminology usage)和合著(co-authorship)研究等不同方法的研究結果相同,這也為許多探討圖書資訊學認知結構的研究認為不應將MIS相關期刊與其他期刊包含在ISI的同一個主題IS&LS下,在進行分析時需要排除MIS相關期刊提供了佐證(Larivière et al. 2012)。此外,並且以多樣性指標(diversity index)分析群集特性,揭露出某些次領域具有地區(regional)特性。

2015年4月15日 星期三

Moya-Anegón, F. de, Vargas-Quesada, B., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.J., & Herrero-Solana, V. (2007). Visualizing the marrow of science. Journal of the American Society for Information Science and Technology, 58(14), 2167–2179.

Moya-Anegón, F. de, Vargas-Quesada, B., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.J., & Herrero-Solana, V.(2007). Visualizing the marrow of science. Journal of the American Society for Information Science and Technology, 58(14), 2167–2179.

由於一般認為將領域之間的關係表示為圖形,通過考慮這些關係的可能性能夠提供許多資訊,不論對新進人員或專家皆有助於理解與分析,因此對這方面方法與工具的需求逐漸提高。過去的研究大多以期刊為分析單位,產生所有科學研究領域的科學映射圖。例如Leydesdorff (2004a, 2004b)使用雙重連結成分(biconnected components)的圖形分析演算法,將JCR 2001的科學研究進行分類。Boyack, Klavans, and Börner (2005)則應用了8種不同的期刊相似性測量7121種SCI和SSCI期刊,並採用VxOrd產生科學映射圖。Samoylenko, Chao, Liu, and Chen (2006)建構科學期刊的最小生成樹(minimum spanning trees),他們使用的資料是SCI 1994到2001的資料。本研究提出一個將ISI (Institute of Scientific Information)類別繪製成科學映射圖的方法,這個方法利用根據類別間的共被引資訊建構類別間的連結,以尋徑網路(PathfinderNetwork)縮減不重要的連結,然後以Kamada-Kawai方法決定節點在圖上的布局(layout),最後利用因素分析(factor analysis)進行結構確認。本研究和先前的研究都是針對類別利用共被引資訊呈現科學映射圖。以類別為分析單位在代表上足夠明確,並且比起較小的單位,這種方式對非專家使用者(nonexpert user)較具有資訊且使用者友善。Moya-Anegón et al. (2004)針對西班牙科學研究領域的視覺化,Moya-Anegón et al. (2005)則進一步利用科學映射圖比較英國、法國和西班牙三個國家的科學研究領域。本研究依循Börner, Chen, and Boyack (2003)提出的知識領域映射流程。使用的資料為7585種ISI期刊,ISI的類別共有219個,但扣除多學科科學後(Multidisciplinary Sciences),採用的類別共218個。利用共被引計算期刊相似性的方式為

Cc(ij)為期刊i和期刊j共被引次數,c(i)和c(j)則分別是期刊i和期刊j被引用次數。然後以尋徑網路和Kamada-Kawai方法繪製網路圖,經過尋徑網路處理後,有較多連結的節點具有較重要的地位。而尋徑網路是一種以型態為主的方法,與以群集為主的因素分析彼此間可以互補,因素分析可以識別、界定與定名科學映射圖上呈現的主題區域,而尋徑網路則負責讓使主題區域更加明顯,將類別分組成束,並顯示連接不同顯著類別的路徑,以及總體的型態結構。。最後總計共分析出35個因素,通過陡坡考驗(scree test)則有16個。科學映射圖上的類別可以分為三個群集:醫學與地球科學、基礎與實驗科學以及社會科學。

This study proposes a new methodology that allows for the generation of scientograms of major scientific domains, constructed on the basis of cocitation of Institute of Scientific Information categories, and pruned using PathfinderNetwork, with a layout determined by algorithms of the spring-embedder type (Kamada–Kawai), then corroborated structurally by factor analysis.

We present the complete scientogram of the world for the Year 2002.

This need arises from the general conviction that an image or graphic representation of a domain favors and facilitates its comprehension and analysis, regardless of who is on the receiving end of the depiction and whether a newcomer or an expert.

Science maps can be very useful for navigating around in scientific literature and for the representation of its spatial relations (Garfield, 1986). They are optimal means of representing the spatial distribution of the areas of research while also offering additional information through the possibility of contemplating these relationships (Small & Garfield, 1985).

From a general viewpoint, science maps reflect the relationships between and among disciplines; but the positioning of their tags clues us into semantic connections while also serving as an index to comprehend why certain nodes or fields are connected with others.

Moreover, these large-scale maps of science show which special fields are most productively involved in research—providing a glimpse of changes in the panorama—and which particular individuals, publications, institutions, regions, or countries are the most prominent ones (Garfield, 1994).

It is a tool in that it allows the generation of maps, and a method in that it facilitates the analysis of domains, by showing the structure and relations of the inherent elements represented. In a nutshell, scientography is a holistic tool for expressing the discourse of the scientific community it aspires to represent, reflecting the intellectual consensus of researchers on the basis of their own citations of scientific literature.

In Moya-Anegón et al. (2004), we ventured forth with a historic evolution of scientific maps from their origin to the present, and proposed ISI-JCR category cocitation for the representation of major scientific domains. Its utility was demonstrated by a visualization of the scientific domain of geographical Spain for the Year 2000.

Since then, other works related with the visualization of great scientific domains have appeared; however, all use journals as the unit of analysis, with the exception of a study based on the cocitation of categories (Moya-Anegón et al., 2005), comparatively focusing on three geographic domains (England, France, and Spain).

In contrast, Leydesdorff (2004a, 2004b) classified world science using the graph-analytical algorithm of biconnected components in combination with JCR 2001.

Boyack, Klavans, and Börner (2005) applied eight alternative measures of journal similarity to a dataset of 7,121 journals covering over 1 million documents in the combined Science Citation and Social Science Citation Indexes, to show the first global map of science using the force-directed graph layout tool VxOrd.

Samoylenko Chao, Liu, and Chen (2006) proposed an approach through the construction of minimum spanning trees of scientific journals, using the Science Citation Index from 1994 to 2001.

In processing and depicting the scientific structure of great domains, we further developed a methodology that follows the flow of knowledge domains and their mapping as proposed by Börner, Chen, and Boyack (2003).

Because ISI assigns each journal to one or more subject categories, to designate a subject matter (i.e., ISI category) for each document, we also downloaded the Journal Citation Report (JCR; Thomson Corporation, 2005a), in both its Science and Social Sciences editions, for 2002.

The downloaded records were exported to a relational database that reflects the structured information of the documents. This new repository contained nearly 1 million (N = 901,493) source documents: articles, biographical items, book reviews, corrections, editorial materials, letters, meeting abstracts, news items, and reviews that had been published in 7,585 ISI journals (N = 5,876 + 1,709). These were classified in a total of 219 categories, altogether citing 25,682,754 published documents.

As informational units, they are, in themselves, sufficiently explicit to be used in the representation of all disciplines that make up science in general. These categories, in combination with the adequate techniques for the reduction of space and the representation of the information to construct scientograms of science or of major scientific domains, prove much more informative and user friendly for quick comprehension and handling by nonexpert users than those obtained by the cocitation of smaller units of cocitation.

For these reasons, we used the 219 categories of the JCR 2002 as units of measure, with the exception of “Multidisciplinary Sciences.” ... The maximum number of categories with which we worked, then, was 218.

In light of our previous experience (Moya-Anegón et al., 2004, 2005), we use cocitation as the similarity measure to quantify the relationship existing between each one of the JCR categories.

Therefore, after a number of trials, we arrived at the conclusion that using tools of Network Analysis, the best visualizations are those obtained through raw data cocitation as the unit of measure. Yet, it also was necessary to reduce the number of coincident cocitations to enhance pruning algorithm yield. Therefore, to those raw data values we added the standardized cocitation value. In this way, we could work with raw data cocitation while also differentiating the similarity values between categories with equal cocitation frequencies. The key was a simple modification of the equation for the standardization of the degree of citation proposed by Salton and Bergmark:




where CM is cocitation measure, Cc is cocitation frequency, c is citation, and i and j are categories.

Over the history of the visualization of scientific information, very different techniques have been used to reduce n-dimensional space. Either alone or in conjunction with others, the most common are multidimensional scaling, clustering, factor analysis, self-organizing maps, and PathfinderNetworks (PFNET).

In our opinion, PFNET with pruning parameters r = ∞, and q = n − 1 is the prime option for eliminating less significant relationships while preserving and highlighting the most essential ones, and capturing the underlying intellectual structure in a economical way.

Although PFNET has been used in the fields of Bibliometrics, Informetrics, and Scientometrics since 1990 (Fowler & Dearhold, 1990), its introduction in citation was due to the hand of Chen (1998, 1999), who introduced a new form of organizing, visualizing, and accessing information. The end effect is the pruning of all paths except those with the single highest (or tied highest) cocitation counts between categories (White, 2001).

The spring embedder type is most widely used in the area of documentation, and specifically in domain visualization. Spring embedders begin by assigning coordinates to the nodes in such a way that the final graph will be pleasing to the eye (Eades, 1984). Two major extensions to the algorithm proposed by Eades (1984) have been developed by Kamada and Kawai (1989) and Fruchterman and Reingold (1991).

While Brandenburg, Himsolt, and Rohrer (1995) did not detect any single predominating algorithm, most of the scientific community goes with the Kamada–Kawai algorithm. The reasons upheld are its behavior in the case of local minima, its capacity to minimize differences with respect to theoretical distances in the entire graph, good computation times, and the fact that it subsumes multidimensional scaling when the technique of Kruskal and Wish (1978) is applied.

We can effortlessly see which are the most important nodes in terms of the number of their connections and, in turn, which points act as intermediaries with other lines, as hubs or forking points.

Whereas factor analysis is a clustering-oriented procedure, PFNET is topology oriented. Yet, they are extremely valuable as complements in the detection of the structure of a scientific domain.

Thus, factor analysis is responsible for identifying, delimiting, and denominating the great thematic areas reflected in the scientogram.

Meanwhile, PFNET is in charge of making the subject areas more visible, grouping their categories into bunches, and showing the paths that connect the different prominent categories, and finally, the overall topology of the domain.

Factor analysis identifies 35 factors in the cocitation matrix of 218 × 218 categories of world science 2002. Through the scree test we extracted 16, which we tagged using the previously explained method; these accumulate 70.2% of the variance (Table 1)

The number of categories included in at least one factor is 195. Twenty-three were not included in any factor (Table 2), and 25 belonged to two factors simultaneously (Table 5).

That is, a category or thematic area occupying a central position in the scientogram will have a more general or universal nature in the domain as a consequence of the number of sources it shares with the rest, contributing more to scientific development than those with a less central position.

The more peripheral the situation of a category or subject area, the more exclusive its nature, and the fewer the sources it will appear to share with other categories; accordingly, the lesser its contribution to the development of knowledge through scientific publications.

An intermediary position favors the interconnection of other categories or thematic areas. 

This broad interpretation of our scientograms not only explains the patterns of cocitation that characterize a domain but also foments an intuitive way for specialists and nonexperts to arrive at a practical explanation of the workings of PFNET (Chen & Carr, 1999).

From a macrostructural point of view, we can distinguish three major zones.

In the center is what we could call Medical and Earth Sciences, consisting of Biomedicine, Psychology, Etiology, Animal Biology & Ecology, Health Care & Service, Orthopedics, Earth & Space Science, and Agriculture & Soil Sciences.

To the right, we can see some other basic and experimental sciences: Materials Sciences & Physics, Applied; Engineering; Computer Science & Telecommunications; Nuclear Physics & Particles & Fields; and Chemistry.

To the left is the neighborhood of the social sciences, with Applied Mathematics, Business, Law, and Economy, and Humanities.

On one hand, it offers domain analysts the possibility of seeing the most essential connections between categories of given domain.

On the other hand, it allows us to see how these categories are grouped in major thematic areas, and how they are interrelated in a logical order of explicit sequences.

2015年4月14日 星期二

Pudovkin, A.I., & Garfield, E. (2002). Algorithmic procedure for finding semantically related journals. Journal of the American Society for Information Science and Technology, 53(13), 1113–1119.

Pudovkin, A.I., & Garfield, E. (2002). Algorithmic procedure for finding semantically related journals. Journal of the American Society for Information Science and Technology, 53(13), 1113–1119.

本研究嘗試利用論文的引用做為參數計算期刊之間的相關因素(relatedness factor),根據計算出來的相關因素找到與目標期刊意義上最相似的期刊。傳統的分類仰賴於根據主觀分析,主觀分析是根據某個或某些特定的分類原因,例如ISI期刊索引報告(Journal Citation Reports,JCR)上的期刊分類便是由經驗法則(heuristic)的主觀方式產生。JCR的作法是在類別建立之後,在同一時間,將新的期刊根據它的相關引用資料進行目測,指定類別;當類別成長,便將類別再細分。除此以外對於個別期刊的分類,也有使用一個未被發表的演算法--Hayne-Coulson algorithm,這個演算法將任何特定的期刊群組做為一個大型期刊(macro-journal),然後產生引用與被引用的期刊資料。在大多數的情況下,這種主觀分析已經足夠,但在一些研究領域中,它被認為是過於粗略而不足並且也受限於與時間的不確定,此外也無法讓使用者可以快速了解哪些期刊是最密切相關的。因此,引進引用索引(citation indexes)與的量化方法被提出來解決這些問題。JCR對每種期刊根據它的引用關係提供了一組最密切相關的期刊,也就是它引用最多的期刊以及引用它最多的期刊,Pudovkin & Garfield (2002)認為這是極為有用並且提供了一種原始的分類,然而由於每種期刊的論文數量不同,使得只能夠得到期刊間關係的淺層感知。因此他們提出了一種期刊間相關因素的測量方式:假定Ri>j表示期刊i和j之間的相關因素,定義Ri>j等於Hi>j * 106 / (Papj * Refi),此處Hi>j是當年度期刊i引用期刊j的次數,Papj與Refi分別是期刊j當年發表的論文數以及期刊i當年論文的參考文獻總數。上述的定義需要注意的是期刊本身的相關因素也許比它對其他期刊的相關因素來得小。此外,為了使兩種期刊A和B之間的相關因素對稱,所以本研究採用RA>B與RB>A中最大的一個,也就是定義RA&Bmax = max(RA>B, RB>A)。本研究以基因與遺傳學領域的核心期刊Genetics為例,研究結果顯示這種根據期刊論文數量加權的相關因素計算方式在發現相關期刊上的效果比未加權的方式來得好,這種方式可以發現原先未被歸入JCR的"Genetics & Heredity"類別但明顯是遺傳學相關的期刊,也可以發現原本歸入這個類別但內容較不相關的期刊。

Using citations, papers and references as parameters a relatedness factor (RF) is computed for a series of journals. Sorting these journals by the RF produces a list of journals most closely related to a specified starting journal.

The method appears to select a set of journals that are semantically most similar to the target journal.

Traditional classification relies on subjective analysis which for one reason or another proves inadequate and is subject to the vagaries of time.

Quantitative methods have been proposed for overcoming these problems. This was greatly facilitated with the introduction of citation indexes in the 1960's and the later introduction of the ISI Journal Citation Reports.

JCR reports inter-journal citation frequencies for thousands of journals. .... Journals are assigned to categories by subjective, heuristic methods.

One of the referees asked for a description of the procedures used by ISI in establishing journal categories for JCR. ... This method is “heuristic” in that the categories have been developed by manual methods started over 40 years ago. Once the categories were established, new journals were assigned one at a time. Each decision was based upon a visual examination of all relevant citation data. As categories grew, subdivisions were established. Among other tools used to make individual journal assignments, the Hayne-Coulson algorithm is used. The algorithm has never been published. It treats any designated group of journals as one macrojournal and produces a combined printout of cited and citing journal data.

In many fields these categories are sufficient but in many areas of research these “classifications” are crude and do not permit the user to quickly learn which journals are most closely related.

JCR provides, for each journal, a set of its most closely related journals based on citation relationships. These are the journals it cites most heavily (cited journals) and also the journals which cite it most often (citing journals). These are extremely useful and provide a crude classification, but unfortunately due to the variations in the sizes of journals one only obtains a superficial perception of the relatedness between two or more specific journals.

We have illustrated the procedure using one core journal in the field of genetics and heredity, the well-known Genetics, published by the Genetics Society of America.

Let journal relatedness of two journals, “i” and “j” be symbolized by Ri>j = Hi>j * 106 / (Papj * Refi), where Hi>j is the number of citations in the current year from journal “i” to journal “j” (to papers published in “j” in all years of ‘j’), Papj and Refi are the number of papers published and references cited in the j-th and i-th journals in the current year.

If we consider a pair of journals, A and B, there may be two indexes: RA>B and RB>A. These can be very different.

It is noteworthy that the citation relatedness of a journal to itself (that is “self-relatedness”) may be lower than its relatedness to some other journals.

Now it is suggested we use the larger of them, RA&Bmax = max(RA>B, RB>A), which we shall call the relatedness factor (RF).

An important feature of the suggested approach is the calculation of SPECIFIC citation relatedness, that is, the new indexes take into consideration the sizes of citing (through the number of references) and cited (through the number of published papers) journals.

The new algorithmic approach enables one to find thematically related journals out of a multitude of journals. ... Weighting citation data by journal size allows identifying journals that are similar in content better than unweighted raw citation data.

In the case of the starting journal Genetics the method identified those journals which are significantly genetic in content, but were not included in the “Genetics & Heredity” category of the JCR. ... Journals included in the “G & H” category are rather heterogeneous in content. Some are highly related to Genetics, while others, as for example journals on medical genetics are poorly related to its content.

JCR has become an established world wide resource but after two or more decades it needs to reexamine its methodology for categorizing journals so as to better serve the needs of the research and library community.

2015年4月9日 星期四

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

本研究比較兩種以內容為基礎的期刊分類以及兩種以演算法為基礎的期刊分類。兩種以內容為基礎的期刊分類分別是ISI的主題分類(Subject Categories)以及Glänzel and Schubert (2003)的領域/次領域分類(field/subfield classification)SOOI,兩種以演算法為基礎的期刊分類則分別是Blondel et al. (2008)提出的展開式(unfolding)社群偵測(community detection)法以及Rosvall, and Bergstrom (2008)的隨機漫步(random walk)矩陣分解(matrix decomposition)法。若是利用以內容為基礎的分類,期刊可以同時指定多個類別;以演算法為基礎的期刊分類則可以使類別內的引用(within-category citation)對類別內的引用(between-category citation)的比率最大化,也就是將期刊彼此之間的引用資料排列成矩陣,經過適當的行列排列後,使得主要對角線(principal diagonal)附近的數值較大,而其他地方則接近0。

各種分類的相關統計資料如表1所示:


由於以內容為基礎的分類方法具有多重分類特性以及以演算法為基礎的分類方法以矩陣分解為目的,從表1上可以觀察到兩種現象:1) 在類別內期刊數的中位數方面,可以看到以內容為基礎的兩種期刊分類方法較以演算法為基礎的期刊分類方法來得多,可配合圖1每個類別期刊數的分佈在0.50上所呈現的情形。另外,圖1也可發現四種分類方法都是對數常態分布(log normal distribution),也就是在這四種分類方法中,相對少數的類別擁有大量的期刊,然而許多類別卻只有少量期刊。並且以演算法為基礎的分類方法比以內容為基礎的分類方法更偏斜(more skewed),也較是上述的情況更嚴重。隨機漫步方法的前十個類別共有57%種期刊,展開方法則有50%,但ISI和SOOI則分別只有15%和31%。


2) 從引用的分布情形來看,兩種以內容為基礎的分類方法的引用次數總計比以演算法為基礎的分類方法多,但隨機漫步方法和展開方法有較多比率分布在類別內,但ISI和SOOI則是主要分布在類別之間。

接下來,以引用式樣(citation patterns)的餘弦相似性(cosine similarity),比較各種分類方法的類別彼此間的相似性。結果ISI和SOOI的中位數分別是0.020和0.066,比隨機漫步方法和展開方法的0.009和0.007高許多,其原因同樣是因為內容為基礎的方法有多重分類的特性,因此類別間的邊緣較模糊,而演算法為基礎的方法在類別間切割得較清楚。然後將各種分類方法的類別依照它們的相似性繪製成網路圖。四種方法繪製的網路圖大致上都可以看出包含兩大群,一個是生物醫學,另一個則是物理學與工程學,兩個大群體透過三個群體相連,包括化學、地理學-環境科學-生態學群體、以及電腦科學,社會科學群體在網路圖上有些分離,透過行為科學/神經科學和生物醫學相連,並且也透過電腦科學與數學和物理學/工程學相連。綜上所述,不同的科學地圖是相似的,但它們在群體內部類別的密度不同。

In this study, we test the results of two recently available algorithms for the decomposition of large matrices against two content-based classifications of journals: the ISI Subject Categories and the field/subfield classification of Glänzel and Schubert (2003).

The content-based schemes allow for the attribution of more than a single category to a journal, whereas the algorithms maximize the ratio of within-category citations over between-category citations in the aggregated category-category citation matrix.

At that time, Leydesdorff & Rafols (2009) were deeply involved in testing the ISI Subject Categories of these same journals in terms of their disciplinary organization. Using the JCR of the Science Citation Index (SCI), we found 14 major components using 172 subject categories, and 6,164 journals in 2006. Given our analytical objectives and the well-known differences in citation behaviour within the social sciences (Bensman,2008), we decided to set aside the study of the (220 − 175 = ) 45 subject categories in the social sciences for a future study.

Our findings using the SCI indicated that the ISI Subject Categories can be used for statistical mapping purposes at the global level despite being imprecise in terms of the detailed attribution of journals to the categories.

In this study, we compare the results of these two algorithms with the full set of 220 Subject Categories of the ISI. In addition to these three decompositions, a fourth classification system of journals was proposed by Glänzel and Schubert (2003) and increasingly used for evaluation purposes by the Steungroep Onderwijs and Onderzoek Indicatoren (SOOI) in Leuven, Belgium. These authors originally proposed 12 fields and 60 subfields for the SCI, and three fields and seven subfields for the Social Science Citation Index and the Arts and Humanities Citation Index. Later, one more subfield entitled “multidisciplinary sciences” was added.

Thus, because research topics are, on the one hand, thinly spread outside the core group and, on the other hand, the core groups are interwoven, one cannot expect that the aggregated journal-journal citation matrix matches one-to-one with substantive definitions of categories or that it can be decomposed in a single and unique way in relation to scientific specialties. The choice of an appropriate journal set can be considered as a local optimization problem (Leydesdorff, 2006).

Citation relations among journals are dense in discipline-specific clusters and are otherwise very sparse, to the extent of being virtually non-existent (Leydesdorff & Cozzens, 2003).

The grand matrix of aggregated journal-journal citations is so heavily structured that the mappings and analyses in terms of citation distributions have been amazingly robust despite differences in methodologies (e.g., Leydesdorff, 1987 and 2007; Tijssen, de Leeuw, & van Raan, 1987; Boyack, Klavans, & Börner, 2005; Moya-Anegón et al., 2007; Klavans & Boyack, 2009).

A decomposable matrix is a square matrix such that a rearrangement of rows and columns leaves a set of square sub-matrices on the principal diagonal and zeros everywhere else.

In the case of a nearly decomposable matrix, some zeros are replaced by relatively small nonzero numbers (Simon & Ando, 1961; Ando & Fisher, 1963). Near-decomposability is a general property of complex and evolving systems (Simon, 1973 and 2002).

The decomposition into nearly decomposable matrices has no analytical solution. However, algorithms can provide heuristic decompositions when there is no single unique correct answer.

Newman (2006a, 2006b) proposed using modularity for the decomposition of nearly decomposable matrices since modularity can be maximized as an objective function.

Blondel et al. (2008) used this function for relocating units iteratively in neighbouring clusters. Each decomposition can then be considered in terms of whether it increases the modularity.

Analogously, Rosvall, and Bergstrom (2008) maximized the probabilistic entropy between clusters by estimating the fraction of time during which every node is visited in a random walk (cf. Theil, 1972; Leydesdorff, 1991).

The data were harvested from the CD-Rom version of the JCR of the SCI and Social Science Citation Index 2006, and then combined. ... The resulting set of 7,611 journals and their citation relations otherwise precisely corresponds to the online version of the JCRs. This large data matrix of 7,611 times 7,611 citing and cited journals was stored conveniently as a Pajek (.net) file and used for further processing.

The 7,611 journals are attributed by the ISI with 11,856 subject classifiers. This is 1.56 (±0.76) classifiers per journal. The ISI staff assign the 220 ISI Subject Categories on the basis of a number of criteria including the journal's title and its citation patterns (McVeigh, personal communication, March 9, 2006; Bensman & Leydesdorff, 2009).

According to the evaluation of Pudovkin and Garfield (2002), in many fields these categories are sufficient, but the authors added that “in many areas of research these ‘classifications’ are crude and do not permit the user to quickly learn which journals are most closely related” (p. 1113).

Leydesdorff and Rafols (2009) found that the ISI Subject Categories can be used for statistical purposes—the factor analysis for example can remove the noise—but not for the detailed evaluation. In the case of interdisciplinary fields, problems of imprecise or potentially erroneous classifications can be expected.

For the purpose of developing a new classification scheme of scientific journals contained in the SCIs, Glänzel and Schubert (2003) used three successive steps for their attribution. The authors iteratively distinguished sets cognitively on the basis of expert judgements, pragmatically to retain multiple assignments within reasonable limits, and scientometrically using unambiguous core journals for the classification. The scheme of 15 fields and 68 subfields is used extensively for research evaluations by the Steunpunt Onderwijs and Onderzoek Indicatoren (SOOI), a research unit at the Catholic University in Leuven, Belgium, headed by Glänzel.

The SOOI categories cover 8,985 journals. Using the full titles of the journals, 7,485 could be matched with the 7,611 journals under study in the JCR data for 2006 (which is 98.3%). These journals are attributed 10,840 classifiers at the subfield level. This is 1.45 (±0.66) categories per journal. One category (“Philosophy and Religion”) is missing because the Arts & Humanities Citation Index is not included in our data. Thus, we pursued the analysis with the 67 SOOI categories.

Using Rosvall and Bergstrom's (2008) algorithm with 2006 data, we obtained findings similar to those of these authors on August 11, 2008. Like the original authors using 6,128 journals in 2004, we found 88 clusters using 7,611 journals in 2006.

Lambiotte, one of the coauthors of Blondel et al. (2008), was so kind as to input the data into the unfolding algorithm and found the following results: 114 communities with a modularity value of 0.527708 and 14 communities with a modularity value of 0.60345. We use the 114 communities for the purposes of this comparison. These categories refer to 7,607 (= 7611 − 4) journals because four of the journals in the file were isolates.

The number of journals per category is log-normally distributed in each of the four classifications. In other words, they all have a relatively small number of categories with a large number of journals and many categories with only a few journals. However, as shown in Figure 1, the classifications based on the random walk and unfolding algorithms are more skewed than the content-based classifications.



Whereas the top-10 categories on the basis of a random walk comprise 57% of the journals (50% for unfolding), they cover only 15% in the ISI decomposition and 31% for the SOOI classification. In the case of skewed distributions, the characteristic number of journals per category can best be expressed by the median: the median is below 30 in the random walk or unfolding classifications, compared with 42 journals for the ISI classification and 141 for the SOOI classification (Table 1).


As presented in the last rows of Table 1, the total numbers of citations in the aggregated matrices based on the ISI or SOOI classifications are much higher because the same citation can be attributed to two or three categories. Thus, whereas random walk and unfolding lead to matrices with most citations within categories (on the diagonal), matrices based on ISI and SOOI classifications lead to matrices with most citations between categories (off-diagonal).

Finally, to measure how similar the categories in the four decompositions are to each other, we computed the cosine similarities in the citation patterns between each pair of citing categories in the four aggregated category-category matrices (Salton & McGill, 1983; Ahlgren, Jarneving, & Rousseau, 2003).

We find again that all the distributions are highly skewed and that the random walk and unfolding algorithms exhibit a much lower median similarity value among categories. The lower medians indicate that the algorithmic decompositions produce a much “cleaner” cut between categories than the content-based classifications.
In conclusion, the analysis of the statistical properties of the different classifications teaches us that the random walk and the unfolding algorithms produce much more skewed distributions in terms of the number of journals per category, but these constructs are more specific than the content-based classification of the ISI and SOOI. The content-based sets are less divided because the boundaries among them are blurred by the multiple assignments.

In summary, although the correspondences among the main categories are sometimes as low as 50% of the journals, most of the mismatched journals appear to fall in areas within the close vicinity of the main categories. In other words, it seems that the various decompositions are roughly consistent but imprecise.

Maps of science for each decomposition were generated from the aggregated category-category citation matrices using the cosine as similarity measure.

The similarity matrices were visualized with Pajek (Batagelj & Mrvar, 1998) using Kamada and Kawai's (1989) algorithm.

The threshold value of similarity for edge visualization is pragmatically set at cosine > 0.01 for the algorithmic decompositions and cosine > 0.2 for the content-based decompositions to enhance the readability of the maps without affecting the representation of the structures in the data.

For the ISI decomposition, the 220 categories (Figure 3) were clustered into 18 macro-categories (Figure 4) obtained from the factor analysis (cf. Leydesdorff and Rafols, 2009).


The map of the SOOI classification was constructed with all is 67 subfields (Figure 5).


Taking advantage of the concentration of journals in a few categories, in the case of random walk and unfolding only the top 30 and 35 categories were used, respectively.


Indeed, the four maps correspond in displaying two main poles: a very large pole in the biomedical sciences and a second pole in the physical sciences and engineering. These two poles are connected via three bridging areas: chemistry, a geosciences-environment-ecology group, and the computer sciences. The social sciences are somewhat detached, linked via the behavioral sciences/neuroscience to the biomedical pole, and via the computer sciences and mathematics to the physics/engineering pole.

As noted above, although categories of different decompositions do not always match with one another, most “misplaced” journals are assigned into closely neighbouring categories. Therefore, the error in terms of categories is not large and is also unsystematic. The noise-to-signal ratio becomes much smaller when aggregated over the relations among categories.

As a second important observation that can be made on the basis of these maps, we wish to point to the differences in category density between the content-based and the algorithm-based maps.

In summary, we were surprised to find that the different science maps are similar except that they differ in the density of categories within groups.

The content-based classifications achieve a more balanced coverage of the disciplines at the expense of distinguishing categories that may be highly similar in terms of journals.

The first finding is that the algorithmic decompositions have very skewed and clean-cut distributions, with large clusters in a few scientific areas, whereas indexers maintain more even and overlapping distributions in the content-based classifications.

Second, the different classifications show a limited degree of agreement in terms of matching categories. In spite of this lack of agreement, however, the science maps obtained are surprisingly similar; this robustness is due to the fact that although categories do not match precisely, their relative positions in the network among the other categories is based on distributions that match sufficiently to produce corresponding maps at the aggregated level.

2015年4月6日 星期一

Chen, C.-M. (2008), Classification of scientific networks using aggregated journal-journal citation relations in the Journal Citation Reports. Journal of the American Society for Information Science and Technology, 59(14), 2296–2304. doi: 10.1002/asi.20935

Chen, C.-M. (2008), Classification of scientific networks using aggregated journal-journal citation relations in the Journal Citation Reports. Journal of the American Society for Information Science and Technology, 59(14), 2296–2304. doi: 10.1002/asi.20935

本研究利用親似傳導法(affinity propagation method, Frey & Dueck, 2007),以彙整的期刊對期刊引用關係(aggregated journal-journal citation relation),對期刊間由相似的引用樣式(citation patterns)形成的科學網路進行分類。過去已有許多以期刊對期刊引用資料進行分析的研究,例如Pudovkin and Garfield (2002) 根據引用資料,發展關係係數(relatedness factor)來發現意義相關的期刊(semantically related journals);Doreian and Fararo (1985)發現網路上結構對等(structure equivalence)的期刊;Leydesdorff and Cozzens (1993)利用主成分分析(principal component analysis)取得科學網路的特徵向量(eigenvectors)。本研究所使用的引用資料包括2001年的SCI(共使用1905種期刊、426065篇文章以及13798138個引用資料)以及2005年的SSCI(共使用1578種期刊、66051篇文章以及2437389個引用資料)。本研究所使用的親似傳導法利用s(i,j)= −dij測量期刊j可以做為期刊i所在類別代表期刊的適合性,而dij的計算為

csij則是期刊間的引用樣式(citation pattern)的相似性:


親似傳導法反覆計算期刊間的兩種數值估算期刊間的代表性,r(i, j)反應期刊j能否代表期刊i的適合程度,

a(i, j)則反應期刊i是否應選擇期刊j作為代表的適合程度,


對期刊i來說,最大的a(i, j) + r(i, j)便指明哪一個期刊j可以代表它。

根據分類的結果,一個分類的專指性(specificity)可以從所有的成員期刊到此分類的代表期刊的平均距離來表示,愈小的平均距離表示這個分類具有愈高的專指性。成員之間的相關性(relatedness of category members)則以所有的期刊之間的平均距離來表示,愈小表示成員間彼此愈靠近。
本研究對SSCI期刊的分類結果共分為23個分類,每一個分類大致符合SSCI的主題分類,然而分類裡所有成員的平均距離比SSCI相對應的分類還要小。

Traditional classification methods (Glänzel & Schubert, 2003) are based on subjective analysis, whose output could vary from one person to another. In other words, these methods are more artistic than scientific.

On the other hand, a quantitative approach to classification is usually constructed based on a set of simple rules, which offers robust classification schemes that do not rely on human interference.

The aggregated journal-journal (J-J) citation data in JCR contain extensive information about interjournal citations, which could provide an understanding of the interaction among various scientific disciplines.

Based on JCR citation data, Pudovkin and Garfield (2002) have used an intuitive criterion (relatedness factor) for finding semantically related journals.

To avoid subjective analysis, various quantitative methods have been proposed to construct a robust classification system of scientific journals using JCR citation information.

A variety of techniques for analyzing J-J citation relationships have been reported in the literature to cluster scientific journals (Doreian & Fararo, 1985; Leydesdorff, 1986; Tijssen, De Leeuw, & Van Raan, 1987).

For example, by applying the notion of structure equivalence to analyze a small set of journals, Doreian and Fararo (1985) have delineated a set of blocks, which contain journals. These blocks have a very close correspondence to a categorization of the journals based on their aims and objectives.

More recently Leydesdorff and Cozzens (1993) have developed an optimization procedure that stabilizes approximated eigenvectors of the scientific network from principal component analysis as representations of clusters. This principal component analysis has been further extended to rotated component analysis (Leydesdorff, 2006; Leydesdorff & Cozzens, 1993), which enables one to focus on specific subsets with internal coherence.

An alternative method of cocitation clustering has been investigated in constructing a World Atlas of Sciences for ISI (Garfield, Malin, & Small, 1975; Leydesdorff, 1987; Small, 1999).

In this article, I propose a quantitative approach to classify the scientific network in terms of aggregated J-J citation relations of JCR using the affinity propagation method (Frey & Dueck, 2007).

The method used by ISI in establishing journal categories for JCR is a heuristic approach, in which the journal categories have been manually developed initially. The assignment of journals was based upon a visual examination of all relevant citation data.

As the number of journals in a category grew, subdivisions of the category were then established subjectively.

Although this is a useful approach, a more robust, convenient, and automatic classification scheme is desired.

The citation data analyzed include the SCI of 2001 and the SSCI of 2005, which are directly computed from the extraction of the CD version of the ISI database.

There are 2,195 journals of impact factor greater than 1 in the 2001 SCI. After removing 290 journals that did not publish any articles in 2001, there are 1,905 journals left in our data set, which contains 426,065 articles and 13,798,138 citations.

For the 2005 SSCI, there are 1,583 journals in the database, of which 1,578 journals have nonzero contents. The SSCI database contains 66,051 articles and 2,437,389 citations.

In principle, the dissimilarity between two journals can be visualized by the differences in their citation patterns. In other words, the citation pattern of each journal is represented by a normalized citation vector, and these vectors form a rescaled citation matrix. The dissimilarity (or similarity) in citation between two journals is related to the scalar product of their citation vectors.

For mapping or visualization, coefficients of similarity are converted into distances such that closely related journals are short distances apart and remotely related journals are long distances apart.

The affinity propagation method takes as input a collection of similarities between journals, where the similarity s(i, j) measures how well journal j is suited to be the representative of a journal category for journal i. Since the goal is to minimize squared error, we set s(i, j) = −dij.

There are two types of messages exchanged between journals, including the responsibility r(i, j), which is sent from journal i to candidate representative journal (RJ) j, and the availability a(i, j), which is sent from candidate representative journal j to journal i. Here the responsibility reflects the accumulated evidence for how well-suited journal j is to serve as the representative for journal i, and the availability shows the accumulated evidence for how appropriate it would be for journal i to choose journal j as its representative.

Taking into account other potential representative journals for journal i, the responsibility is computed iteratively as

where the initial value of a(i, j) is set to zero in the first iteration. Similarly, taking into account the support from other journals that journal j should be a representative, the availability is updated by gathering evidence from journals as to whether each candidate representative would make a good representative journal:

To reflect accumulated evidence that journal j is a representative based on the positive responsibilities sent to candidate representative j from other journals, the self-availability is updated as

During the process of affinity propagation, the sum of availability and responsibility can be used to identify the representative journal of emerging journal categories. In other words, for any journal i, the value of j that maximizes a(i, j) + r(i, j) identifies that journal j is its representative.

In our classifications, the level of specificity of a category can be found by looking at its value of DRJ (the average distance of members of a category to its representative journal), and relatedness of category members is implied by the value of DJ-J (the average J-J distance within a category).

To demonstrate the applicability of the affinity propagation method in clustering a complete data set of journals, we first apply it to cluster journals in the 2005 SSCI database.

Here the cutoff parameter t is set to 0.0001, implying that the maximal value of DJ-J (DJ-Jmax) is 100. This choice of t is quite reasonable since the probability distribution (PD), or normalized histogram (bin size is 1), of DJ-J in the unclustered SSCI journal database is mostly between 0 and 30, as shown in Figure 1.



With a choice of DJ-Jmax = 100, the distance between unrelated journals is much larger than that between related journals. In other words, for any journal category, unrelated journals will not be located in the vicinity of its members (each journal is considered as a point in a high-dimensional space). Thus only correlated journals will be grouped together by the affinity propagation method.

However, if DJ-Jmax is too close to 30, the positions of unrelated journals are not well separated and the distortion to the journal positions due to the introduction of the cutoff would affect the clustering of journals.

For the predicted SSCI classification, only those J-J distances within the same category are considered in calculating its PD of DJ-J.

In Figure 1, there are two peaks observed from the statistical curves of PD in DJ-J, where the first peak shows the relatedness between journals within the database (or categories), while the second peak at DJ-J = 100 indicates the irrelevance between journals within the database (or categories).

For the predicted SSCI classification, clearly its first peak in the PD of DJ-J is much more prominent and the peak width is much more narrow than that of the unclustered SSCI database.

On the other hand, its second peak of irrelevance is much smaller than that of the unclustered database.

The probability distribution of the first peak is found to decrease exponentially with DJ-J, i.e., P = P0 exp[−(DJ-J − d0)/ Δ], where P0 is the peak value, d0 is the peak position, and Δ is the decay width. By fitting the statistical data, we find that d0 = 4 and Δ = 9.08 for the unclustered SSCI curve, while d0 = 2 and Δ = 1.72 for the clustered SSCI curve.

The entire journal set of SSCI is decomposed into 23 journal categories.

The relatedness of journals within a category can be seen as the average value of DJ-J within the category, and the specificity of a category is related to the average distance of category members to its RJ.

For any category, a smaller value of DRJ implies a higher level of specificity, and a smaller value of DJ-J implies that journals within a category are more closely related to each other.

In general most categories in our classification scheme have a corresponding category in the ISI classification scheme, and their value of DJ-J seems to be smaller than that of their counterpart in the ISI classification scheme.

When a larger value of the cutoff parameter is used, the maximal distance of DJ-J becomes smaller. ... Since the high-dimensional J-J distance space is now approximated by a high-dimensional sphere of smaller radius, the resolution in clustering journals is higher in this case. Thus the SCI database is expected to be decomposed into more clusters for t = 10−3, compared to the case of t = 10−4. ... Therefore, from comparing clustering results with different values of the cutoff parameter, the relationship among various disciplines can be revealed.

Our results demonstrate that the affinity propagation method can provide a reasonable classification scheme for either a complete database or an incomplete database. This method does not need the number of categories or their size as an input.

Distance between journals is calculated from the similarity of their annual citation patterns with a cutoff parameter to restrain the maximal distance.

Different values of the cutoff parameter lead to different levels of resolution in the classification of journal network. A more coarse-grained classification is obtained when a smaller value of the cutoff parameter (or a larger maximal J-J distance) is used.

We note that, unlike the ISI classification scheme, which allows overlap in the content of journal categories by subjective decisions, each journal uniquely belongs to a category in our classification scheme.

2015年4月2日 星期四

Wang, F., & Wolfram, D. (2014). Assessment of journal similarity based on citing discipline analysis. Journal of the Association for Information Science and Technology.

Wang, F., & Wolfram, D. (2014). Assessment of journal similarity based on citing discipline analysis. Journal of the Association for Information Science and Technology.

利用Web of Science的主題分類,計算引用期刊的學科頻率分布能夠提供被引用期刊進行相似性比較的特徵,相較於共被引方法,這種相似性比較的維度較小,可以減少許多計算量。本研究比較Web of Science的資訊科學與圖書館學主題分類下的40種高影響力期刊,並以多維尺度法(multidimensional scaling)和階層式群集分析(hierarchical cluster analysis)比較比較所提出的方法與共被引方法的相似性估算結果。分析期刊的出版時間範圍為1987到2011,以5年為一個時期進行分析。在各期刊中,以Scientometrics (SCI)以及Journal of the Association for Information Science and Technology (JASIST)的引用期刊分布的學科最多元,因為JASIST有較廣的涵蓋範圍以及其他領域都對測量研究(metrics research)感到興趣。產生的映射圖與群集結果顯示某些期刊並不接近其他期刊。相似性估算結果顯示引用學科分析與共被引分析相似,各個時期兩種方法所得到的結果在分為三個群集的情況下,大多可以發現包含一個LIS群集、一個MIS群集以及一個較分散而邊緣的群集,不過組成群集的成員也有些不同,因此Wang and Wolfram (2014)建議可以引用學科分析做為共被引分析的補充。

The frequency distribution of disciplines by citing articles provides a signature for a cited journal that
permits it to be compared with other journals using similarity comparison techniques.

As an initial exploration, citing discipline data for 40 high-impact-factor journals assigned to the “information science and library science” category of the Web of Science were compared across 5 time periods. Similarity relationships were determined using multidimensional scaling and hierarchical cluster analysis to compare the outcomes produced by the proposed citing discipline and established cocitation methods.

The maps and clustering outcomes reveal that a number of journals in allied areas of the information science and library science category may not be very closely related to each other or may not be appropriately situated in the category studied.

The citing discipline similarity data resulted in similar outcomes with the cocitation data but with some notable differences. Because the citing discipline method relies on a citing perspective different from cocitations, it may provide a complementary way to compare journal similarity that is less labor intensive than cocitation analysis.

The application of visualization techniques to groups of bibliographic entities (publications, journals, or authors) provides a method for assessing the closeness of relationships among entities of interest. ... On a fundamental level, these investigations allow us to understand better the structure of disciplines based on the production of scholarship and how this changes over time (e.g., White & McCain, 1998). On a more specific level, findings can help to assess the impact of entities of interest or to situate disciplines or specializations within a larger context.

Leydesdorff and Cozzens (1993) studied how to delineate and attribute journals to specialties based on journal−journal citations and their changes over time. They demonstrated how the data could be used to construct macrojournals, consisting of aggregations of journals around a central journal.

Pudovkin and Garfield (2002) developed a journal relatedness factor based on citing and cited journals. The method was proposed to help identify thematically related journals.

Similarly, Glänzel and Schubert (2003) proposed the categorization of journals using a three-step process involving predefined categories, journal classification, and article classification for articles in journals with ambiguous subject assignments based on references.

Rafols and Leydesdorff (2009) compared the outcomes of two algorithms for the decomposition of large matrices against Web of Science (WoS) subject categories and Glänzel and Schubert’s categorization. The four methods resulted in similar map outcomes on a large scale.

Leydesdorff and Schank (2008) visualized and animated the disciplinary ties of three seed journals over time to demonstrate relationships among journals and their interdisciplinarity.

Boyack and Klavans (2010) compared results from cocitation analysis, bibliographic coupling, direct citation, and a hybrid approach for accuracy of outcomes in representing research fronts for a large corpus of biomedical literature. They noted that bibliographic coupling performed the best in representing the research fronts.

White (2000) proposed the use of citers to identify characteristics of a given author’s research such as an author’s citation identity, which consists of all the authors a given author cites. White also introduced the idea of citation image-makers, consisting of the authors who refer to a cited author. The citation image-makers approach may also be applied to journals, where citing authors constitute the citation image-makers of the journal.

Yan, Ding, Milojević, and Sugimoto (2012) explored community structures in IR research by combining topic modeling and community detection with IR literature to reveal the changing landscape of IR research.

To reduce the dimensionality of the similarity comparison, disciplinary identifiers for citing articles/journals may be used to reduce the number of comparisons that have to be made.

For the purposes of this study, WoS research areas are used. In this paper the research areas are referred to as disciplinary assignments.

This research is guided by several questions.
1. Does the frequency distribution of disciplines of citing journals permit comparison of journal similarities in a meaningful way?
2. Are the results of such a comparison similar or complementary to the better-established approach of cocitation analysis?
3. Do the similarities among journals within the same disciplinary categorization change over time as reflected in the changes in the frequency distribution of citing journal disciplines?
4. Can these similarities (or distances) provide decision support for whether journals should be grouped together in citation index services such as Thomson Reuters’ Journal Citation Reports?

Forty high-impact journals included in the Thomson Reuters’ 2011 Journal Citation Reports grouped in the category ISLS were selected for the study.  ... In addition to many of the journals rated highly in library and information science (LIS), as evidenced by a perception study of LIS deans and Association of Research Library directors conducted by Nisonger and Davis (2005), this category includes journals in allied areas such as management information systems (MIS), geographic information systems, and medical informatics.

Among the 20 highest-impact journals listed in the ISLS category, only 3 are included in the top 20 journals rated by LIS deans based on their familiarity with these journals. The majority of the remaining journals in the top 20 based on impact factor could be argued to be from allied areas given their additional classification in other WoS research areas and the lack of familiarity or resulting lower prestige as determined by LIS deans.

Citing article/journal data were collected from 1987 to 2011 and were divided into 5-year intervals.

For each journal, all articles, review articles, and conference proceeding articles were selected; all other publication types such as cited material were excluded. For each time period, the “create citation report” in the WoS was selected to identify all citing articles. The number associated with “citing articles” was then selected to retrieve the list of citing articles. The WoS “analyze results” feature was next selected for the list of citing articles. On the results analysis page, “research areas” were selected as the ranking field to provide the tabulated list of citing disciplines. The ranked list of citing disciplines was then copied into an MS Excel spreadsheet.

The list of research areas and their frequencies represent the journal’s citing discipline profile for each time period.

Salton’s cosine similarity measures were calculated for each pair of journals to produce a symmetric
matrix of journal similarity values ranging between 0 and 1 (Ahlgren, Jarneving, & Rousseau, 2003, 2004; Egghe & Leydesdorff, 2009; Leydesdorff, 2006, 2007) for each time period.

To provide a baseline comparison, a cocitation analysis was also conducted with the same journals.

Multidimensional scaling (MDS) PROXSCAL analysis and hierarchical cluster analysis in SPSS v.20 were applied to the symmetric similarity matrices.

The PROXSCAL algorithm was used instead of ALSCAL for the MDS procedure because it allows similarity or dissimilarity matrices to be used and has been shown to provide superior results for cocitation studies (Leydesdorff & Vaughan, 2006).

For hierarchical clustering, Ward’s method was used. Minkowski distance and squared Euclidean distance were each explored and produced the same outcomes at the three-cluster level. Clustering outcomes were superimposed onto the MDS maps.

Library Resources and Technical Services (LRTS) consistently attracted citations from the fewest discipline areas, indicating a narrower interdisciplinary focus. In fact, the number of citing article disciplines has declined over the past decade for this journal, possibly indicating even narrower interdisciplinary impact.

Scientometrics (SCI) and the Journal of the Association for Information Science and Technology (JASIST), on the other hand, at different time periods each attract the most disciplinarily diverse citations. These outcomes are not unexpected given the broad coverage of JASIST and the interest in metrics research by other disciplines.



The MDS map of the journals using the proposed citing discipline approach for the first period appears in Figure 1. Among the journals, 14 of the 22 are situated in close proximity. A secondary group with three journals is situated on the periphery.

In combination with the cluster-analysis groupings, one can see at the three-cluster level that the tightly clustered journals are core to LIS.

A more widely dispersed second cluster of five journals consists of LIS and allied area journals in MIS. ... It is interesting to note that Government Information Quarterly (GIQ), International Journal of Geographical Information Science (IJGIS), and Journal of the Medical Library Association (JMLA)—at the time, still the Bulletin of the Medical Library Association–are situated more closely to and are clustered with the journals associated with the MIS area.

A peripheral “Other” cluster contains three journals.  ... Telecommunication Policy (TP), Journal of Scholarly Communication (JSP), and Social Science Information (SSI) are situated on the periphery of the map for the first and second time periods, indicating little similarity with the other journals in the citing discipline distributions.


The equivalent cocitation analysis map (Figure 2) at the three-cluster level, produces similar outcomes, but with several notable differences.

The International Journal of Information Management (IJIM) is situated more closely to LIS journals than to those in MIS.

Two of the MIS journals are situated in their own cluster along with GIQ and TP, equivalent to the “other” category. IJGIS appears at the periphery of the map in the MIS category.

The remaining journals are subdivided into two clusters that may be characterized broadly as information science and library science, respectively, with JSP and SSI being a part of these clusters.

There is a 63.6% overlap (14 of 22 journals) in the cluster assignments, indicating that there is a moderate level of agreement between the two approaches.

For the second time period, the three clusters for the citing discipline-based analysis consisted of a group of 12 journals representing the LIS area, an emerging cluster of journals focusing on the MIS area and several journals in allied areas, and an “other” group consisting of JSP, SSI, TP, and IJGIS.

The cocitation analysis outcomes for the second time period reveal a similar mapping arrangement and clustering of journals, with 15 journals corresponding to the LIS category, eight representing a group with an MIS focus, and an “other” category consisting of journals in allied areas.

The citing discipline MDS map and cluster analysis results for the three-cluster level are similar to the first two time periods, but with more distinctive LIS, MIS, and other clusters as the number of journals in each cluster has grown.

The cocitation analysis map and resulting clusters at the three-cluster level consist of primarily LIS journals, those in MIS, and the other category similar in composition to the citing discipline outcome. ... The cluster assignment match at the three-cluster level between the citing discipline and cocitation analysis methods is 88% (29 of 33 journals), indicating a high level of agreement.

The citing discipline MDS map for the fourth time period is similar to that for the previous time period.

Of note with the cocitation cluster analysis outcome for the fourth time period is a much larger other category that includes a number of journals categorized as LIS by the citing discipline method. GIQ and INFSOC are situated between the LIS and MIS groups, although they are placed in the other group.

The citing discipline and cocitation maps for the fifth time period appear in Figures 5 and 6, respectively. The outcomes for the citing discipline approach are quite similar to those for the third and fourth time periods, with well-defined LIS and MIS categories and a more scattered other category on the periphery.

The three clusters based on the cocitation analysis data again reflect the LIS, MIS, and other groupings. There are fewer members in the other category than for the fourth time period

Much in the same way that dimensionality reduction used in certain statistical methods and IR allows for simplified comparisons, the use of the WoS research areas by citing journals and their frequency instead of citing authors or citing journals provides a less computationally intensive way to assess journal similarity by reducing the dimensionality of the comparisons and the computational overhead.


Wolfram, D., & Zhao, Y. (2014). A comparison of journal similarity across six disciplines using citing discipline analysis. Journal of Informetrics, 8(4), 840-853.

Wolfram, D., & Zhao, Y. (2014). A comparison of journal similarity across six disciplines using citing discipline analysis. Journal of Informetrics, 8(4), 840-853.

揭露與更好地了解科學傳播 (scholarly communication) 的中研究人員、研究團隊、機構、地區/國家、學科、出版品之間的關係,可以在不同尺度下進行研究。分析資料之間的連結可以從直接引用、共被引、共同著作、詞語或主題的共現,或者隱含式主題等形式進行。期刊間的相似性經常利用共被引,過去曾經進行期刊共被引研究的學科有經濟學 (McCain, 1991)、資訊檢索 (Ding, Chowdhury & Foo, 2000)、資訊系統 (Marion, Wilson & Davis, 2005)、醫療資訊學 (medical informatics, Morris & McCain, 1998)、類神經網路 (neural networks, McCain, 1998)以及半導體研究 (Tsay, Xu & Wu, 2003)。然而共被引研究有以下的困難:首先,在多個學科裡的應用,期刊與期刊間的共被引矩陣可能相當稀疏(Boyack, Klavans, & Börner, 2005)。其次,若是沒有Web of Science等資料庫來源,共被引資料的取得將會相當困難。最後,共被引分析大多只利用共被引次數,而沒有考慮引用來源的任何特性。

資訊計量學的重要研究之一是利用引用資料來確認期刊的學科或專業背景,錯誤的分類結果將會影響期刊在領域內的排名。Glänzel and Schubert (2003)發展一個三個步驟的期刊分類程序,使得期刊裡的文章可以根據參考文獻指定主題。Rafols and Leydesdorff (2009)對Web of Science的主題分類(Subject Categories)和Glänzel and Schubert (2003)的主題分類,比較兩種大型矩陣的分解演算法。Leydesdorff and Rafols (2009) 也使用主題分類引用頻率的引用矩陣研究170多種Web of Science的主題分類之間的關係。Leydesdorff and Schank (2008)以視覺化及動畫的方式呈現期刊之間的關係與它們的跨領域性。

過去的研究有利用期刊引用形象(journal citation image),也就是引用目標期刊的所有期刊的列表,做為期刊之間相似度評估的特徵。在計算上,可將期刊引用形象加入引用的頻率分布做為目標期刊的一種特徵(signature)。然而由於具有影響力與聲譽的期刊可能有相當大量的引用期刊,造成期刊引用形象的計算量較大。Wang and Wolfram (2014)提出使用引用期刊所屬的學科,利用引用學科的引用頻率做為期刊的特徵,來降低計算量。Wang and Wolfram (2014),提出引用學科分析(citing discipline analysis)來評估被引用期刊(cited journals)之間的相似性。引用學科分析根據目標期刊的引用期刊(citing journals)在Web of Science的研究領域上的頻率分布,取代直接。Wang and Wolfram (2014)並將引用學科分析應用於JCR的資訊科學與圖書館學 (Information Science & Library Science)的40種期刊,他們發現在多元尺度與群集分的結果中,若干期刊與其他期刊並不接近,有些同時被歸類於其他學科的期刊並不接近於資訊科學與圖書館學的期刊,而且當這些期刊同時也歸類於較大的相關領域時,其具有較高的影響係數(impact factors)將會降低其他期刊的排名。由於Wang and Wolfram (2014)只探討一個學科內的期刊,無法了解多個學科的期刊是否也具有同樣的情形。

本研究同樣利用引用學科分析估計期刊間的相似性。本研究使用來自於6個學科的120種期刊做為研究資料,其中5個學科彼此間較為接近,包括傳播學 (Communication)、電腦科學-資訊系統 (Computer Science-Information Systems)、教育學與教育研究 (Education & Educational Research)、資訊科學與圖書館學 (Information Science & Library Science)、管理學 (Management),另一個學科-地理學 (Geology) 則較遠。選取期刊的出版期間為1987到2012年,分為三個時期1987–1995、 1996–2004和 2005–2012。利用餘弦測量計算期刊間的相似性,並將估計結果應用於多元尺度(multidimensional scaling)、階層式群集分析(hierarchical cluster analysis)、主成分分析(Principal Component Analysis)等技術。

第一時期可以發現六個學科的相關期刊分為五個群集,其中電腦科學-資訊系統的期刊因為在這時期的數量較少而與資訊科學與圖書館學形成一個群集,地理學的群集與其他群集的距離相當遠。原本主題被分類在傳播學的 Journal of Advertising Research(JAR),在這時期的結果裡與管理學的相關期刊較接近。MIS相關期刊以及 Telecommunications Policy (TP)在資訊科學與圖書館學與管理學之間,Social Science Computer Review (SSCR)則是在資訊科學與圖書館學、傳播學與管理學三個學科間,另外Science Communication (SCOMM)雖然主題被分類在傳播學,但在本研究裡的結果,則是歸入資訊科學與圖書館學的群集。


第二時期電腦科學-資訊系統已經和與資訊科學與圖書館學分開,120種期刊共形成6個群集。在這時期的結果,部分主題分類於資訊科學與圖書館學的期刊被歸入其他學科,包括The Journal of the American Medical Informatics Association (JAMIA)和The International Journal of Geographical Information Science (IJGIS)被歸入電腦科學-資訊系統,後者甚至也很接近地理學;Decision Support Systems (DSS)則是原本在電腦科學-資訊系統分類下,而被歸入資訊科學與圖書館學的群集。Journal of Health Communication (JHC)的主題分類有傳播學和資訊科學與圖書館學,但在這時期更靠近傳播學的相關期刊而被歸類在傳播學的群集內。此外,原本分別在教育學與教育研究以及傳播學的the Academy of Management Learning& Education (AMLE)和JAR都被歸類在管理學的群集裡。

第三時期的六個群集對應到六個學科,但原先主題分類為傳播學的一些期刊更靠近於管理學期刊,而同時具有教育學與教育研究以及資訊科學與圖書館學兩種主題的International Journal of Computer-supported Collaborative Learning (IJCSCL),在結果上明顯可歸類為教育學與教育研究的群集。

有些被指定在某一個學科的期刊結果更接近其他學科,但有些被指定在多個學科的期刊則發現只有接近其中的一個學科。因此這個研究所提出的方法在計算期刊間的接近程度時能夠與期刊共被引分析(journal co-citation analysis)等傳統方法互補。由於以下的幾種原因,有愈來愈多的期刊跨越學科的邊界:1) 期刊出版的範圍愈來愈跨學科(more interdisciplinary),因此吸引愈來愈多其他學科的出版品引用。2) 整合式搜尋工具愈來愈普遍,更容易讓其他學科的作者發現。3) 愈多的期刊加入分析,使得期刊在引用學科的特徵上獨特性減少,彼此間更加相似。

為了更加了解跨學科的期刊,本研究利用主成分分析探討期刊的引用學科特徵。在第一時期,資訊科學與圖書館學的相關期刊中,共有五種期刊屬於兩種成分,包含ISJ, ISR, MISQ等三種管理學期刊以及屬於傳播學和電腦科學-資訊系統的SCOMM和DSS。但在資訊科學與圖書館學主題分類下的ARIST, EJIS, IM, JASIST, JIS, JIT, JSIS以及 MISQ同時也被分類在電腦科學-資訊系統主題內,但卻沒有出現在電腦科學-資訊系統的成份裡。並且JAR和 AMLE分別被分類在傳播學和教育學與教育研究,但在本研究的第一時期結果只屬於管理學。

在第二時期的結果中,IM, ISJ, ISR, JIT, JMIS, JSIS, MISQ等許多MIS期刊同時出現在管理學和資訊科學與圖書館學的成份裡。雖然SSCR僅有被分類在資訊科學與圖書館學主題,但同時包含在資訊科學與圖書館學與傳播學的成分中。另外,雖然IJGIS被分類在資訊科學與圖書館學主題,但並沒有出現在任何成分,顯然是六個學科較周邊的期刊。

第三時期同時在管理學和資訊科學與圖書館學成份裡的MIS期刊更增加了。沒有出現在任何成分的期刊,除了IJGIS以外,還增加了Journal of Chemical Information and Modeling (JCIM) and Journal of Cheminformatics (JCHEM)。

A similarity comparison is made between 120 journals from five allied Web of Science disciplines (Communication, Computer Science-Information Systems, Education & Educational Research, Information Science & Library Science, Management) and a more distant discipline (Geology) across three time periods using a novel method called citing discipline analysis that relies on the frequency distribution of Web of Science Research Areas for citing articles.

Similarities among journals are evaluated using multidimensional scaling with hierarchical cluster analysis and Principal Component Analysis.

The resulting visualizations and groupings reveal clusters that align with the discipline assignments for the journals for four of the six disciplines, but also greater overlaps among some journals for two of the disciplines or categorizations that do not necessarily align with their assigned disciplines.

Some journals categorized into a single given discipline were found to be more closely aligned with other disciplines and some journals assigned to multiple disciplines more closely aligned with only one of the assigned disciplines.

The proposed method offers a complementary way to more traditional methods such as journal co-citation analysis to compare journal similarity using data that are readily available through Web of Science.

Aspects of scholarly communication may be investigated from different levels of granularity to reveal and better understand relationships between researchers, research groups, institutions, regions/nations, specializations/disciplines, publications or publication outlets.

Connections that exist between sources of interest may take the form of direct citations, co-citations, co-authorship, co-occurrence of words or subjects, or more recently, latent topics.

Journal similarity comparison has been frequently studied using co-citations. Journal co-citation studies have been carried out on a number of fields including economics (McCain, 1991), information retrieval (Ding, Chowdhury & Foo, 2000), information systems (Marion, Wilson & Davis, 2005), medical informatics (Morris & McCain, 1998), neural networks (McCain, 1998), and semiconductor research (Tsay, Xu & Wu, 2003). 

One reason co-citation studies tend to focus on individual fields is that the journal–journal co-citation matrix that emerges when multiple disciplines are employed can be quite sparse (Boyack, Klavans, & Börner, 2005).

Co-citation data can also be labor-intensive to extract and are not easily available through citation database sources such Thomson Reuters Web of Science (WoS) without downloading all references from a corpus of articles.

Citation-based data may also be used to identify disciplinary or specialization affiliations for journals. This is particularly important for informetrics studies, where the misclassification of journals may affect the ranking of journals within a given field.

Pudovkin and Garfield (2002) developed a journal relatedness factor based on citing and cited journals. The goal of their proposed method was to help identify thematically related journals.

Similarly, Glänzel and Schubert (2003) developed a three-step process for the categorization of journals that involved pre-defined categories, journal classification and article classification for articles in journals with ambiguous subject assignments based on references.

More recently, Rafols and Leydesdorff (2009) compared the outcomes of two algorithms for the decomposition of large matrices against Web of Science Subject Categories and Glänzel and Schubert’s categorization. The four methods they used resulted in similar map outcomes on a large scale. Leydesdorff and Rafols (2009) also investigated the relationships among 170+ Web of Science Subject Categories using a citation matrix consisting of the subject category citation frequencies. They concluded that a classification scheme could be developed using analytical arguments.

Similarly, Leydesdorff and Schank (2008) visualized and animated the disciplinary ties of three seed journals over time to demonstrate relationships among journals and their interdisciplinarity.

Co-citation analysis relies on citing articles to identify the strength of relationships between the units of interest, whether authors, papers or journals; however, it does not consider any attributes of the source of the citations – only that the citations or co-citations exist. Authors such as White (2001) and Ajiferuke, Lu, and Wolfram (2010) have called for a shift in the focus of citation-based research away from citation counts received by an author of interest to the origin of the citation and its characteristics to assess author impact from a different perspective.

This research investigates the use of data derived from citing journals to assess the similarity of cited journals.

The journal citation image of a target journal, which is determined by the list of journals that cite the target journal, provides an indicator of the reach of a journal. When combined with the frequencies of citation by the citing journals, the frequency distribution of citations provided by the citing journals creates a “signature” for each cited journal. These signatures may be compared using various analytical methods.

One possible challenge associated with using the citing journals themselves to create a signature for a cited journal is the potentially high number of citing journals that an influential and prolific journal might attract.

Wang and Wolfram (forthcoming) proposed a method to reduce the computational overhead associated with the citing journal data. Their method of citing discipline analysis uses the subjects/disciplines assigned to the citing journal and the resulting citation frequencies of the citing disciplines to constitute the cited journal’s signature.

Wang and Wolfram (forthcoming) employed citing discipline analysis to explore journal similarity among 40 high impact journals in Information Science and Library Science (ISLS) as classified in Journal Citation Reports (JCR). They found that some of the journals classified into the ISLS category did not map in close proximity to one another based on multidimensional scaling and cluster analysis. A number of the journals included were also classified into allied fields, but did not cluster or appear in close proximity to a number of journals only classified in ISLS.

The authors noted that how journals are classified can impact journal rankings within a given field, where journals from related, but larger, fields may have higher journal impact factors (IF), which can reduce the rank of journals that are directly in the field. They observed that many of the high impact journals were from allied areas to ISLS.

One limitation of their exploratory study was the focus on a single discipline. Could similar affinities or differences in journal similarity be revealed using citing discipline analysis with journals from multiple fields? Also, by looking at multiple fields, does this pull journals also classified into other disciplines further out of an assigned category when included?

The present research is guided by the following questions:
1. To what extent are high impact journals from allied disciplines similar to one another based on the discipline of the articles that cite a given journal?
2. Do journals classified into multiple disciplines more closely align with one discipline than another or serve as bridges between the disciplines to which they are mapped based on the citing discipline distribution?

The field of Information Science and Library Science was selected as the seed discipline based on its interdisciplinary nature and familiarity to the authors. The top 20 journals based on 2012 JCR impact factors were selected. Four additional allied WoS disciplines were also selected based on the co-classification of journals appearing in the top 20 ISLS list with other disciplines, and the affiliation of information science and library science academic units with other disciplinary units, which demonstrates another type of alliance.

The four JCR disciplines selected comprised:
◦ Communication (COMM) – based on the existence of schools of communication & information.
◦ Computer Science, Information Systems (CSIS) – based on a number of iSchools and journal overlap in JCR.
◦ Education & Educational Research (EDER) – based on a number of ISLS units affiliated with colleges/schools of education.
◦ Management (MGMT) – based on the overlap of journals, particularly in Management Information Systems (MIS).

A sixth, more intellectually distant, discipline, namely Geology (GEOL), was also included. Geology was selected based on the outcomes of the UCSD Map of Science (Börner et al., 2012), where Earth Sciences were mapped as distant from the Social Sciences. By including journals from a more distant discipline, the ability for the citing discipline method to distinguish between more closely aligned and distant disciplines could be tested, where the distinctiveness of allied disciplines may be less defined by including a more distant discipline in the analysis.

A total of 120 journals were studied over the time period 1987–2012. To allow for a comparison over time, the journals were subdivided into three time periods: 1987–1995, 1996–2004, and 2005–2012. 

The data collection method for determining the frequency distribution of citing disciplines used in Wang and Wolfram(forthcoming) was adopted for the present study.

The “Create Citation Report” option in WoS was selected to identify all citing articles. The number of Citing Articles was then selected to retrieve the list of citing articles. The WoS “Analyze Results” feature was next selected for the list of citing articles. On the Results Analysis page, “Research Areas” were selected as the ranking field to provide the tabulated list of citing disciplines.

Salton’s Cosine measure was used determine the similarity between pairs of journals, resulting in a symmetric similarity matrix (Ahlgren, Jarneving, & Rousseau, 2003; Egghe & Leydesdorff, 2009; Leydesdorff, 2006).

Multidimensional scaling (MDS) analysis and hierarchical cluster analysis using SPSS v.20 were employed to visualize and categorize the relationships among the journals for each time period.

To provide a complementary analysis of the hidden groups that may be present in the data, SPSS’s Factor Analysis using Principal Component extraction with varimax rotation was also applied to the data using routines.


Fig. 1 shows the MDS locus of 70 selected journals in the first time period (1987–1995). The raw stress value was 0.01294,and the stress-I was 0.11376.

Only five clusters are shown because a distinctive sixth cluster did not emerge for this time period.

At the five-cluster level of assignment, journals from COMM, EDER, and MGMT form coherent clusters, although based on the MDS map some journals in each field are more closely located to journals in an allied discipline.

The fourth cluster combines journals from ISLS and CSIS. It is possible that the relatively small number of purely CSIS journals for this time period did not provide enough data for these journals to cluster into separate groups.

The MIS journals are situated in the ISLS cluster, but are located between the Library and Information Science (LIS) journals and MGMT journals.

The fifth cluster on the right side of the map consists of GEOL journals and, as would be expected, is quite distinctive from the other five disciplines.

The location of some journals on the map suggests that they are more similar to journals in one of the other given categories in JCR. The Journal of Advertising Research (JAR), for example, is situated with management journals but is only classified with COMM (and with Business, but this discipline is not included in this study).

Some journals classified in two disciplines served as a bridge between the two disciplines in the map. For instance, Science Communication (SCOMM), although classified with COMM journals, is situated in the ISLS cluster, but is in closer proximity to the COMM journals. In the remaining two time periods, SCOMM clusters with the COMM journals. Social Science Computer Review (SSCR), which is classified with ISLS and clusters with the discipline, is situated between ISLS, COMM and/or MGMT journals in each time period. The same is observed with Telecommunications Policy (TP), which bridges ISLS and MGMT for each of the periods of study.

The outcome for the second time period (1996–2004) appears in Fig. 2. The raw stress and stress-I values are 0.01648 and 0.12798, respectively.

In this map, 93 journals were categorized into six clusters, with CSIS separating from the ISLS cluster during this time period.

Of note is the greater number of journals assigned to one or more disciplines but aligning more closely to another discipline or only one of the assigned disciplines.

As an example, the Academy of Management Learning& Education (AMLE) and JAR are situated in the management cluster, and are located relatively far from their assigned disciplines, EDER and COMM, respectively.

Decision Support Systems (DSS) clusters with the ISLS journals but is classified in CSIS only. The same classification is observed for this journal in the third time period.

The Journal of the American Medical Informatics Association (JAMIA) is classified in ISLS and CSIS, but clusters with the CSIS journals for the remaining time periods.

Similarly, the Journal of Health Communication (JHC), which is classified in COMM and ISLS, does not appear to be similar to other ISLS journals and is situated more closely to COMM journals and clusters with them.

The International Journal of Geographical Information Science (IJGIS), which is classified with ISLS journals clusters with CSIS journals for this time period and the third period, although it is at the periphery of the cluster, perhaps indicating the CSIS discipline is the best match of the disciplines studied, but is not a very close match. Proximally, it is situated between CSIS and GEOL journals, which may indicate at least a peripheral similarity to some GEOL journals.

Again, the GEOL journals all cluster together farther from the other disciplinary groups.
Results for the third time period (2005–2012) appear in Fig. 3. The six clusters roughly correspond to the six disciplines. The results of raw stress calculation and the stress-I calculation are still relatively low, at 0.02224 and 0.14914, respectively.

Additional journals classified in COMM map closely to and cluster more closely with MGMT journals.

The International Journal of Computer-supported Collaborative Learning (IJCSCL) is classified with both EDER and ISLS but is clearly situated and clusters with the EDER journals.

Business Strategy and the Environment (BSE), although clustered with MGMT journals, appears to be pulled toward the GEOL journals, indicating a possible relationship with some of these journals.

Once again, the GEOL journals are distinctly clustered away from the remaining journals.

With each time period, more journals cross disciplinary boundaries by clustering with journals from allied disciplines or by mapping more closely to journals in allied disciplines. There may be several influencing factors to account for this observation.

First, the journals indeed may be becoming more interdisciplinary in their publication coverage, thereby attracting more citations from publications in other disciplines.

Second, the journals themselves may not be more interdisciplinary in their coverage, but are now more easily discovered by authors in other disciplines given the wider availability of federated search tools.

Third, with a greater number of journals included in the analysis for each time period, the distinctiveness of the citing discipline signatures may be decreasing, so some journals classified in allied disciplines may appear more similar to one another.

To determine common dimensions from the dataset, a Principal Component Analysis was conducted in SPSS for each time period. Outcomes for the Kaiser–Meyer–Olkin measure of sample adequacy (above 0.7) and Bartlett’s Test of Sphericity (p < .05) indicate the data were appropriate for PCA for all three time periods.


In this period, the six components explain 88.1% of the total variance that correspond to the six disciplines.

There are five journals underlined in the Table 2 that belong to two components, introducing inter-factorial complexity (Van den Besselaar & Heimeriks, 2001; Leydesdorff, 2007), including three MGMT journals (ISJ, ISR, MISQ).

SCOMM and DSS are assigned only to COMM and CSIS by WoS, respectively, but they also appear in the ISLS component, which supports the MDS and clustering outcomes. DSS continues to also load with ISLS for the remaining time periods.

Similarly, JAR and AMLE, journals classified by WoS only in COMM and EDER, respectively, load with the MGMT component for all the time periods in which they appear, but not their classified discipline, lending support for the re-classification of these journals.

The ISLS journals ARIST, EJIS, IM, JASIST, JIS, JIT, JSIS, and MISQ are also classified at CSIS, but do not load with the CSIS component.

Outcomes for 1996–2004 appear in Table 3.

As with the first time period, several MIS journals (IM, ISJ, ISR, JIT, JMIS, JSIS, MISQ) load to both MGMT and ISLS.

SSCR, which is classified with ISLS only, loads into the ISLS component, but also loads with a higher value into the COMM component, perhaps indicating the need for an additional classification assignment.

IJGIS does not load into any of the six components for second or third time period, lending evidence to the peripheral nature of the journal to the six fields studied and indicating it might be misclassified in ISLS.

Component outcomes for 2005–2012 appear in Table 4.

Similar to the previous time period, a growing number of MIS journals load to the ISLS and MGMT components.

As with IJGIS, two other journals, Journal of Chemical Information and Modeling (JCIM) and Journal of Cheminformatics (JCHEM) do not load into any of the six components, indicating a poor association with the six disciplines.

The boundaries between ISLS and CSIS are not as clear in the MDS and cluster analysis outcomes, where combinations of computer science, library and information science and management information systems journals may cluster together depending on the time period. These results may be influenced by the fact that a number of journals in the ISLS area are also categorized in the CSIS or MGMT category, thereby strengthening their relationships.

Despite the influence of the assigned discipline(s) – which then strengthens the disciplinary relationship(s) through journal self-citations, whether or not it is the best fit – some journals appear to be misclassified based on the disciplinary designations of the citations they attract.

As a prime example, JAR, which is only assigned to the COMM field, is situated in the MGMT category for each of the time periods studied for the MDS and clustering analysis as well as with the Principal Component Analysis.

A similar outcome is observed for AMLE for the two time periods in which it is included. It is classified as an EDER journal, but is situated with the MGMT journals based on the analyses conducted.

Several other journals are classified in more than one discipline, but clearly associate only with journals in one of the disciplines. JHC is classified in COMM and ISLS but clusters only with COMM journals. The same is observed for IJCSCL, which is classified in ISLS and EDER but groups only with EDER journals for each of the grouping methods used.

Other journals appear to move between disciplines over time. SCOMM is classified in COMM, but the MDS and clustering outcome places it initially with the ISLS journals for the first time period, and then in the COMM cluster and further away from ISLS in last two time periods.

Some journals, such as JCMC and SSCR, appear to be situated near borders between disciplines, which point to their interdisciplinary appeal or may indicate they serve as bridges between the disciplines.

A number of the ISLS journals that are considered to be library and information science journals (Nisonger & Davis, 2005) appear between CSIS journals and those in the MGMT cluster. In particular, MIS journals appear between MGMT and the ISLS/CSIS cluster for the first time period.

One application of citing discipline analysis that emerges from this analysis is that of decision support for the additional assignment or reassignment of journals to one or more disciplines.

Journal disciplinary classifications should be revisited over time to accommodate shifts in how journals are being cited by other disciplines. ... However, the shifts are at least an indication that the subject affiliations of the citing the journals are changing.

Citing discipline analysis provides, with some modest programming, a relatively easily implemented method for assessing the similarity of journals within disciplines or across allied disciplines that is computationally less expensive than using citing journal-based data.

The analyses reveal distinct groupings of journals based on their disciplinary assignments. As observed earlier in Wang and Wolfram (forthcoming), who only examined journals in a single field, the current research has demonstrated that citing discipline analysis can provide coherent and meaningful disciplinary groupings for journals in allied fields, even when journals from a more intellectually distant field are included.

The clustering and proximity of some journals classified in allied fields has changed over time, perhaps indicating a changing citing relationship between these fields.