2015年3月30日 星期一

Chen, C., Ibekwe-SanJuan, F. and Hou, J. (2010), The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61 (7), 1386–1409. doi: 10.1002/asi.21309

Chen, C., Ibekwe-SanJuan, F. and Hou, J. (2010), The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61 (7), 1386–1409. doi: 10.1002/asi.21309

確認科學領域的專業(specialties)本質是資訊科學的一項基本挑戰 (Morris & Van der Veer Martens, 2008; Tabah, 1999) 。由於1)可取用的書目資料來源愈來愈普及;2)網路上愈來愈多可提供分析與視覺化的電腦軟體工具;3)從多元來源而大量的資料吸收的要求愈來愈劇烈等原因,因此有愈來愈多的相關研究。共被引分析是對科學進行量化分析最常用的方法之一,特別是作者共被引分析 (author cocitation analysis, ACA; Chen, 1999; Leydesdorff, 2005; White & McCain, 1998; Zhao & Strotmann, 2008b)以及文件共被引分析 (document cocitation analysis, DCA; Chen, 2004; Chen, 2006; Chen, Song, Yuan, & Zhang, 2008; Small & Greenlee, 1986; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985)。作者共被引分析的目的在透過被相關文獻一起引用的作者群集,確認領域裡的專業。重要的作者共被引分析研究包括White & McCain (1998),這個研究以1972到1995年間12種資訊科學相關期刊的120位高被引作者進行作者共被引分析,研究結果發現當時的資訊科學分為兩個基本上彼此獨立的陣營:資訊檢索(information retrieval)與文獻(literature)。Zhao and Strotmann (2008a, 2008b) 以1996-2005年的資訊科學相關期刊資料重新進行了相同的研究,他們的結果發現了5個主要的專業:使用者研究(user studies)、引用分析(citation analysis)、實驗型檢索(experimental retrieval)、網路計量學 (Webometrics)以及知識領域的視覺化(visualization of knowledge domains),其中新興的兩個專業:網路計量學和知識領域的視覺化連繫了引用分析以及實驗型檢索,而使用者研究則是此時最大的專業。Aström (2007) 則是使用文件共被引分析的例子,他們分析了1990到2004年的21種圖書資訊學期刊,利用多維尺度法(multidimensional scaling, MDS)產生結果,他們的結果與White & McCain (1998)的研究類似,整個領域可分為兩個陣營,不過Aström (2007)的結果將稱為資訊尋求與檢索(information seeking and retrieval),而不是資訊檢索。

不管是作者共被引分析或是文件共被引分析其步驟大致如下:
1) 檢索引用資料。
2) 建構參考文件或作者共同被引用的矩陣。
3) 將共被引矩陣表示成節點與連結的圖(node-and-link graph)或是多維尺度法的組態(configuration),並且可以利用尋路網路(Pathfinder network scaling)或最小生成樹(minimum spanning tree)裁減連結。
4) 利用群集、社群發現(community finding)、因素分析(factor analysis)、主成分分析(principle component analysis)或者隱含語意索引(latent semantic indexing)等各種演算法確認專業。例如Morris & Van der Veer Martens (2008)、 Persson (1994)、 Tabah (1999)、 White & Griffith (1982)以及Janssens, Leta, Glänzel, and De Moor (2006)。
5) 根據群集成員間共同的主題(themes),解釋共被引群集的性質。通常需要豐富的領域知識,而且是一個花費大量時間與認知需求(cognitively demanding)的工作。

本研究對於作者共被引以及文件共被引形成的群集進行結構與動態的描述與解釋,分析的資料為1996到2008年間的12種資訊科學(information science)領域相關期刊,共計10853筆書目紀錄,引用的參考文獻為129060筆,引用次數為206180,而參考文獻的作者共有58711位。本研究以餘弦(cosine)測量作者或文件之間的關連大小,做為節點間的連結,建立網路;然後計算從原先網路導出的Laplacian矩陣(Laplacian matrices)的特徵向量(eigenvectors)找出群集。這種利用標準線性代數的頻譜群集(spectral cluster)演算法,較其他的群集演算法更有效率,而且因為不需要假設群集的形式,所以更有彈性與強健。標註群集方面則是利用引用文獻論文的詞語與摘要句,詞語包括題名與摘要中出現的名詞片語與索引詞(index terms),利用 tf*idf (Salton, Yang, & Wong, 1975)、對數似然比(log-likelihood ratio, LLR)測試 (Dunning, 1993)以及相互資訊(mutual information, MI)等三種資訊做為判斷的參考。摘要句則是從題名與摘要尋找最有代表性的句子,例如以Enertex (Fernandez, SanJuan, & Torres-Moreno, 2007)對句子進行排序。




A multiple-perspective cocitation analysis method is introduced for characterizing and interpreting the structure and dynamics of cocitation clusters.

The generic method is applied to a three-part analysis of the field of information science as defined by 12 journals published between 1996 and 2008: (a) a comparative author cocitation analysis (ACA), (b) a progressive ACA of a time series of cocitation networks, and (c) a progressive document cocitation analysis (DCA).

Identifying the nature of specialties in a scientific field is a fundamental challenge for information science (Morris & Van der Veer Martens, 2008; Tabah, 1999).

The growing interest in mapping and visualizing the structure and dynamics of specialties is because of a number of reasons:
1. Widely accessible bibliographic data sources such as the Web of Science, Scopus, and Google Scholar (Bar-Ilan, 2008; Meho & Yang,2007) as well as domain-specific repositories such as ADS (http://www.adsabs.harvard.edu/) and arXiv (http://arxiv.org/).
2. Freely available computer programs and Web-based general-purpose visualization and analysis tools such as ManyEyes (http://manyeyes.alphaworks.ibm.com/) and Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/; Batagelj & Mrvar, 1998), special-purpose citation analysis tools such as CiteSpace (http://cluster.cis.drexel.edu/&u0007E;cchen/citespace/; Chen, 2004; Chen, 2006), and social network analysis such as UCINET (http://www.analytictech.com/ucinet6/ucinet.htm).
3. Intensified challenges for digesting the vast volume of data from multiple sources (e.g., e-Science, Digging into Data (http://www.diggingintodata.org/), cyber-enabled discovery, SciSIP; Lane, 2009).

Cocitation studies are among the most commonly used methods in quantitative studies of science, especially including author cocitation analysis (ACA; Chen, 1999; Leydesdorff, 2005; White & McCain, 1998; Zhao & Strotmann, 2008b) and document cocitation analysis (DCA; Chen, 2004; Chen, 2006; Chen, Song, Yuan, & Zhang, 2008; Small & Greenlee, 1986; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985).

For instance, once cocitation clusters are identified, assigning the most meaningful labels for these clusters is currently a challenging task because any representative labels of clusters must characterize not only what clusters appear to represent, but also the salient and unique reasons for their formation.

The new procedure reduces analysts' cognitive burden by automatically characterizing the nature of a cocitation cluster in terms of (a) salient noun phrases extracted from titles, abstracts, and index terms of citing articles and (b) representative sentences as summarizations of clusters.

ACA aims to identify underlying specialties in a field in terms of groups of authors who were cited together in relevant literature.

White & McCain (1998) presented a comprehensive view of information science based on 12 journals in library and information science across a 24-year span (1972–1995). It analyzed cocitation patterns of 120 most-cited authors with factor analysis and multidimensional scaling. The authors drew upon their extensive knowledge of the field and offered an insightful interpretation of 12 specialties identified in terms of 12 factors. The most well-known finding of the study is that information science at the time consisted of two essentially independent camps, namely, the information retrieval camp and the literature camp, including citation analysis, bibliometrics, and scientometrics.

Zhao and Strotmann (2008a, 2008b) followed up White and McCain's study using the same set of 12 journals and the same number of 120 cited authors in an updated time frame of 1996-2005. ... Zhao and Strotmann (2008b) found five major specialties and manually labeled them as user studies, citation analysis, experimental retrieval, Webometrics, and visualization of knowledge domains. In contrast to the findings of (White & McCain, 1998), experimental retrieval and citation analysis retained their fundamental roles in the field, and the user studies specialty became the largest specialty. Webometrics and visualization of knowledge domains appeared to make connections between the retrieval camp and the citation analysis camp.

A DCA by Aström (2007) studied papers published between 1990 and 2004 in 21 library and information science journals. Results were depicted in multidimensional scaling (MDS) maps. Aström's study also identified the two-camp structure found by (White & McCain, 1998). On the other hand, Aström found an information seeking and retrieval camp, instead of the information retrieval camp as in (White and McCain).

Although manually labeling a cocitation cluster can be a very rewarding process of learning about the underlying specialty and result in insightful and easy to understand labels, it requires a substantial level of domain knowledge and it tends to be time-consuming and cognitively demanding because of the synthetic work required over a diverse range of individual publications.

Traditionally, researchers often identify the nature of a cocitation cluster based on common themes among its members. ... The emphasis on common areas is a practical strategy; otherwise, comprehensively identifying the nature of a specialty can be too complex to handle manually.

Many researchers have studied the structural and dynamic properties of specialties in information science in terms of clusters, multivariate factors, and principle components (Morris & Van der Veer Martens, 2008; Persson, 1994; Tabah, 1999; White & Griffith, 1982).

A recent study of information science (Ibekwe-SanJuan, 2009) mapped the structure of information science at the term level using a text analysis system TermWatch and a network visualization system Pajek, but it did not address structural patterns of cited references.

Researchers also studied the structure of information science qualitatively, especially with direct inputs from domain experts. For example, Zins conducted a Critical Delphi study of information science, involving 57 leading information scientists from 16 countries (Zins, 2007a, 2007b, 2007c, 2007d).

Janssens, Leta, Glänzel, and De Moor (2006) studied the full-text of 938 publications in five library and information science journals with latent semantic analysis (LSA; Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) and agglomerative clustering. They found an optimal 6-cluster solution in terms of a local maximum of the mean silhouette coefficients (Rousseeuw, 1987) and a stability diagram (Ben-Hur, Elisseeff, & Guyon, 2002). Their clusters were labeled with single-word terms selected by tf*idf (p. 1625), which are not as informative as multiword terms for cluster labels.

Klavans, Persson, and Boyack (2009) recently raised the question of the true number of specialties in information science. They suspected that the number is much more than the 11 or 12 as reported in ACA studies such as (White & McCain, 1998) and (Zhao & Strotmann, 2008a, 2008b), but significantly fewer than the 72 reported in their own study, which is also based on the 12 journals between 2001 and 2005.

The 12-journal Information Science dataset, retrieved from the Web of Science, contains 10,853 unique bibliographic records, written by 8,408 unique authors from 6,553 institutions and 89 countries. These articles cited 129,060 unique references for a total of 206,180 times. They cited 58,711 unique authors and 58,796 unique sources.

The traditional procedure of cocitation analysis for both DCA and ACA comprises the following steps:
1. Retrieve citation data from sources such as the Science Citation Index (SCI), Social Science Citation Index (SSCI), Scopus, and Google Scholar.
2. Construct a matrix of cocited references (DCA) or authors (ACA).
3. Represent the cocitation matrix as a node-and-link graph or as a multidimensional scaling (MDS) configuration with possible link pruning using Pathfinder network scaling or minimum spanning tree algorithms.
4. Identify specialties in terms of cocitation clusters, multivariate factors, principle components, or dimensions of a latent semantic space using a variety of algorithms for clustering, community finding, factor analysis, principle component analysis, or latent semantic indexing.
5. Interpret the nature of cocitation clusters.

The interpretation step is the weakest link. It is time-consuming and cognitively demanding, requiring a substantial level of domain knowledge and synthesizing skills. In addition, much of attention routinely focuses on cocitation clusters per se, but the role of citing articles that are responsible for the formation of such cocitation clusters may not be always investigated as an integral part of a specialty.

Our new method extends and enhances traditional cocitation methods in two ways: (a) by integrating structural and content analysis components sequentially into the new procedure and (b) by facilitating analytic tasks and interpretation with automatic cluster labeling and summarization functions. The new procedure is highlighted in yellow in Figure 2, including clustering, automatic labeling, summarization, and latent semantic models of the citing space (Deerwester et al., 1990).

Our new procedure adopts several structural and temporal metrics of cocitation networks and subsequently generated clusters.

Structural metrics include betweenness centrality, modularity, and silhouette.

Temporal and hybrid metrics include citation burstness and novelty

The betweenness centrality metric is defined for each node in a network. It measure the extent to which the node is in the middle of a path that connects other nodes in the network (Brandes, 2001; Freeman, 1977). High betweenness centrality values identify potentially revolutionary scientific publications (Chen, 2005) as well as gatekeepers in social networks.

In the context of this study, the modularity Q measures the extent to which a network can be divided into independent blocks, i.e., modules (Newman, 2006; Shibata, Kajikawa, Taked, & Matsushima, 2008).

The silhouette metric (Rousseeuw, 1987) is useful in estimating the uncertainty involved in identifying the nature of a cluster.

Burst detection determines whether a given frequency function has statistically significant fluctuations during a short time interval within the overall time period.

Sigma is introduced in (Chen, et al., 2009a) as a measure of scientific novelty. ... In this study, Sigma is defined as (centrality + 1)burstness such that the brokerage mechanism plays more prominent role than the rate of recognition by peers.

We adopt a hard clustering approach such that a cocitation network is partitioned to a number of nonoverlapping clusters.

In this article, cocitation similarities between items i and j are measured in terms of cosine coefficients.

A good partition of a network would group strongly connected nodes together and assign loosely connected ones to different clusters. This idea can be formulated as an optimization problem in terms of a cut function defined over a partition of a network. Technical details are given in relevant literature (Luxburg, 2006; Ng, Jordan, & Weiss, 2002; Shi & Malik, 2000).

Spectral clustering is an efficient and generic clustering method (Luxburg, 2006; Ng et al., 2002; Shi & Malik, 2000). It has roots in spectral graph theory. Spectral clustering algorithms identify clusters based on eigenvectors of Laplacian matrices derived from the original network.

Spectral clustering has several desirable features compared to traditional algorithms such as k-means and single linkage (Luxburg, 2006):
 • It is more flexible and robust because it does not make any assumptions on the forms of the clusters,
• it makes use of standard linear algebra methods to solve clustering problems, and
• it is often more efficient than traditional clustering algorithms.

Candidates of cluster labels are selected from noun phrases and index terms of citing articles of each cluster. These term are ranked by three different algorithms. In particular, noun phrases are extracted from titles and abstracts of citing articles. The three term ranking algorithms are tf*idf (Salton, Yang, & Wong, 1975), log-likelihood ratio (LLR) tests (Dunning, 1993), and mutual information (MI).

Each cocitation cluster is summarized by a list of sentences selected from the abstracts of articles that cite at least one member of the cluster.

In this study, sentences are ranked by Enertex (Fernandez, SanJuan, & Torres-Moreno, 2007). Given a set S of N sentences, let M be the square matrix that for each pair of sentences gives the number of nominal words in common (nouns and adjectives).

In this study, summarization sentences were also ranked by two new functions gtf and gftidf , which are further simplified approximations of the energy function E.

The ACA and DCA studies described in this article were conducted using the CiteSpace system (Chen, 2004; Chen, 2006). CiteSpace is a freely available Java application for visualizing and analyzing emerging trends and changes in scientific literature.

CiteSpace supports a unique type of cocitation network analysis—progressive network analysis—based on a time slicing strategy and then synthesizing a series of individual network snapshots defined on consecutive time slices. Progressive network analysis particularly focuses on nodes that play critical roles in the evolution of a network over time. Such critical nodes are candidates of intellectual turning points.

In summary, (a) spectral clustering and factor analysis identified about the same number of specialties, but they appeared to reveal different aspects of cocitation structures and (b) cluster labels chosen from citers of a cluster tend to be more specific terms than those chosen by human experts.

We found the comparison with the study of Zhao and Strotmann very valuable. It offered us an opportunity to compare the analysis conducted by human experts to the interpretation cues provided by our automatic labeling and summarization methods.

Spectral clustering for the purpose of network decomposition is exclusive in nature although in reality it is often sensible to allow overlapping clusters because of multiple roles individual entities may play.

Spectral clustering of cocitation networks tends to generate distinct clusters with high precision, whereas human experts tend to aggregate entities into broadly defined clusters.

In conclusion, the new cocitation analysis procedure has the following advantages over the traditional one:
• It can be consistently used for both DCA and ACA.
• It uses more flexible and efficient spectral clustering to identify cocitation clusters.
• It characterizes clusters with candidate labels selected by multiple ranking algorithms from the citers of these clusters and reveals the nature of a cluster in terms of how it has been cited.
• It provides metrics such as modularity and silhouette as quality indicators of clustering to aid interpretation tasks.
• It provides integrated and interactive visualizations for exploratory analysis.

Modularity and silhouette metrics provide useful quality indicators of clustering and network decomposition.

沒有留言:

張貼留言