看見網絡: 3月 2015

2015年3月30日星期一

Chen, C., Ibekwe-SanJuan, F. and Hou, J. (2010), The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61 (7), 1386–1409. doi: 10.1002/asi.21309

Chen, C., Ibekwe-SanJuan, F. and Hou, J. (2010), The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61 (7), 1386–1409. doi: 10.1002/asi.21309

確認科學領域的專業(specialties)本質是資訊科學的一項基本挑戰 (Morris & Van der Veer Martens, 2008; Tabah, 1999) 。由於1)可取用的書目資料來源愈來愈普及；2)網路上愈來愈多可提供分析與視覺化的電腦軟體工具；3)從多元來源而大量的資料吸收的要求愈來愈劇烈等原因，因此有愈來愈多的相關研究。共被引分析是對科學進行量化分析最常用的方法之一，特別是作者共被引分析 (author cocitation analysis, ACA; Chen, 1999; Leydesdorff, 2005; White & McCain, 1998; Zhao & Strotmann, 2008b)以及文件共被引分析 (document cocitation analysis, DCA; Chen, 2004; Chen, 2006; Chen, Song, Yuan, & Zhang, 2008; Small & Greenlee, 1986; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985)。作者共被引分析的目的在透過被相關文獻一起引用的作者群集，確認領域裡的專業。重要的作者共被引分析研究包括White & McCain (1998)，這個研究以1972到1995年間12種資訊科學相關期刊的120位高被引作者進行作者共被引分析，研究結果發現當時的資訊科學分為兩個基本上彼此獨立的陣營：資訊檢索(information retrieval)與文獻(literature)。Zhao and Strotmann (2008a, 2008b) 以1996-2005年的資訊科學相關期刊資料重新進行了相同的研究，他們的結果發現了5個主要的專業：使用者研究(user studies)、引用分析(citation analysis)、實驗型檢索(experimental retrieval)、網路計量學 (Webometrics)以及知識領域的視覺化(visualization of knowledge domains)，其中新興的兩個專業：網路計量學和知識領域的視覺化連繫了引用分析以及實驗型檢索，而使用者研究則是此時最大的專業。Aström (2007) 則是使用文件共被引分析的例子，他們分析了1990到2004年的21種圖書資訊學期刊，利用多維尺度法(multidimensional scaling, MDS)產生結果，他們的結果與White & McCain (1998)的研究類似，整個領域可分為兩個陣營，不過Aström (2007)的結果將稱為資訊尋求與檢索(information seeking and retrieval)，而不是資訊檢索。

不管是作者共被引分析或是文件共被引分析其步驟大致如下：
1) 檢索引用資料。
2) 建構參考文件或作者共同被引用的矩陣。
3) 將共被引矩陣表示成節點與連結的圖(node-and-link graph)或是多維尺度法的組態(configuration)，並且可以利用尋路網路(Pathfinder network scaling)或最小生成樹(minimum spanning tree)裁減連結。
4) 利用群集、社群發現(community finding)、因素分析(factor analysis)、主成分分析(principle component analysis)或者隱含語意索引(latent semantic indexing)等各種演算法確認專業。例如Morris & Van der Veer Martens (2008)、 Persson (1994)、 Tabah (1999)、 White & Griffith (1982)以及Janssens, Leta, Glänzel, and De Moor (2006)。
5) 根據群集成員間共同的主題(themes)，解釋共被引群集的性質。通常需要豐富的領域知識，而且是一個花費大量時間與認知需求(cognitively demanding)的工作。

本研究對於作者共被引以及文件共被引形成的群集進行結構與動態的描述與解釋，分析的資料為1996到2008年間的12種資訊科學(information science)領域相關期刊，共計10853筆書目紀錄，引用的參考文獻為129060筆，引用次數為206180，而參考文獻的作者共有58711位。本研究以餘弦(cosine)測量作者或文件之間的關連大小，做為節點間的連結，建立網路；然後計算從原先網路導出的Laplacian矩陣(Laplacian matrices)的特徵向量(eigenvectors)找出群集。這種利用標準線性代數的頻譜群集(spectral cluster)演算法，較其他的群集演算法更有效率，而且因為不需要假設群集的形式，所以更有彈性與強健。標註群集方面則是利用引用文獻論文的詞語與摘要句，詞語包括題名與摘要中出現的名詞片語與索引詞(index terms)，利用 tf*idf (Salton, Yang, & Wong, 1975)、對數似然比(log-likelihood ratio, LLR)測試 (Dunning, 1993)以及相互資訊(mutual information, MI)等三種資訊做為判斷的參考。摘要句則是從題名與摘要尋找最有代表性的句子，例如以Enertex (Fernandez, SanJuan, & Torres-Moreno, 2007)對句子進行排序。

A multiple-perspective cocitation analysis method is introduced for characterizing and interpreting the structure and dynamics of cocitation clusters.

The generic method is applied to a three-part analysis of the field of information science as defined by 12 journals published between 1996 and 2008: (a) a comparative author cocitation analysis (ACA), (b) a progressive ACA of a time series of cocitation networks, and (c) a progressive document cocitation analysis (DCA).

Identifying the nature of specialties in a scientific field is a fundamental challenge for information science (Morris & Van der Veer Martens, 2008; Tabah, 1999).

The growing interest in mapping and visualizing the structure and dynamics of specialties is because of a number of reasons:
1. Widely accessible bibliographic data sources such as the Web of Science, Scopus, and Google Scholar (Bar-Ilan, 2008; Meho & Yang,2007) as well as domain-specific repositories such as ADS (http://www.adsabs.harvard.edu/) and arXiv (http://arxiv.org/).
2. Freely available computer programs and Web-based general-purpose visualization and analysis tools such as ManyEyes (http://manyeyes.alphaworks.ibm.com/) and Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/; Batagelj & Mrvar, 1998), special-purpose citation analysis tools such as CiteSpace (http://cluster.cis.drexel.edu/&u0007E;cchen/citespace/; Chen, 2004; Chen, 2006), and social network analysis such as UCINET (http://www.analytictech.com/ucinet6/ucinet.htm).
3. Intensified challenges for digesting the vast volume of data from multiple sources (e.g., e-Science, Digging into Data (http://www.diggingintodata.org/), cyber-enabled discovery, SciSIP; Lane, 2009).

Cocitation studies are among the most commonly used methods in quantitative studies of science, especially including author cocitation analysis (ACA; Chen, 1999; Leydesdorff, 2005; White & McCain, 1998; Zhao & Strotmann, 2008b) and document cocitation analysis (DCA; Chen, 2004; Chen, 2006; Chen, Song, Yuan, & Zhang, 2008; Small & Greenlee, 1986; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985).

For instance, once cocitation clusters are identified, assigning the most meaningful labels for these clusters is currently a challenging task because any representative labels of clusters must characterize not only what clusters appear to represent, but also the salient and unique reasons for their formation.

The new procedure reduces analysts' cognitive burden by automatically characterizing the nature of a cocitation cluster in terms of (a) salient noun phrases extracted from titles, abstracts, and index terms of citing articles and (b) representative sentences as summarizations of clusters.

ACA aims to identify underlying specialties in a field in terms of groups of authors who were cited together in relevant literature.

White & McCain (1998) presented a comprehensive view of information science based on 12 journals in library and information science across a 24-year span (1972–1995). It analyzed cocitation patterns of 120 most-cited authors with factor analysis and multidimensional scaling. The authors drew upon their extensive knowledge of the field and offered an insightful interpretation of 12 specialties identified in terms of 12 factors. The most well-known finding of the study is that information science at the time consisted of two essentially independent camps, namely, the information retrieval camp and the literature camp, including citation analysis, bibliometrics, and scientometrics.

Zhao and Strotmann (2008a, 2008b) followed up White and McCain's study using the same set of 12 journals and the same number of 120 cited authors in an updated time frame of 1996-2005. ... Zhao and Strotmann (2008b) found five major specialties and manually labeled them as user studies, citation analysis, experimental retrieval, Webometrics, and visualization of knowledge domains. In contrast to the findings of (White & McCain, 1998), experimental retrieval and citation analysis retained their fundamental roles in the field, and the user studies specialty became the largest specialty. Webometrics and visualization of knowledge domains appeared to make connections between the retrieval camp and the citation analysis camp.

A DCA by Aström (2007) studied papers published between 1990 and 2004 in 21 library and information science journals. Results were depicted in multidimensional scaling (MDS) maps. Aström's study also identified the two-camp structure found by (White & McCain, 1998). On the other hand, Aström found an information seeking and retrieval camp, instead of the information retrieval camp as in (White and McCain).

Although manually labeling a cocitation cluster can be a very rewarding process of learning about the underlying specialty and result in insightful and easy to understand labels, it requires a substantial level of domain knowledge and it tends to be time-consuming and cognitively demanding because of the synthetic work required over a diverse range of individual publications.

Traditionally, researchers often identify the nature of a cocitation cluster based on common themes among its members. ... The emphasis on common areas is a practical strategy; otherwise, comprehensively identifying the nature of a specialty can be too complex to handle manually.

Many researchers have studied the structural and dynamic properties of specialties in information science in terms of clusters, multivariate factors, and principle components (Morris & Van der Veer Martens, 2008; Persson, 1994; Tabah, 1999; White & Griffith, 1982).

A recent study of information science (Ibekwe-SanJuan, 2009) mapped the structure of information science at the term level using a text analysis system TermWatch and a network visualization system Pajek, but it did not address structural patterns of cited references.

Researchers also studied the structure of information science qualitatively, especially with direct inputs from domain experts. For example, Zins conducted a Critical Delphi study of information science, involving 57 leading information scientists from 16 countries (Zins, 2007a, 2007b, 2007c, 2007d).

Janssens, Leta, Glänzel, and De Moor (2006) studied the full-text of 938 publications in five library and information science journals with latent semantic analysis (LSA; Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990) and agglomerative clustering. They found an optimal 6-cluster solution in terms of a local maximum of the mean silhouette coefficients (Rousseeuw, 1987) and a stability diagram (Ben-Hur, Elisseeff, & Guyon, 2002). Their clusters were labeled with single-word terms selected by tf*idf (p. 1625), which are not as informative as multiword terms for cluster labels.

Klavans, Persson, and Boyack (2009) recently raised the question of the true number of specialties in information science. They suspected that the number is much more than the 11 or 12 as reported in ACA studies such as (White & McCain, 1998) and (Zhao & Strotmann, 2008a, 2008b), but significantly fewer than the 72 reported in their own study, which is also based on the 12 journals between 2001 and 2005.

The 12-journal Information Science dataset, retrieved from the Web of Science, contains 10,853 unique bibliographic records, written by 8,408 unique authors from 6,553 institutions and 89 countries. These articles cited 129,060 unique references for a total of 206,180 times. They cited 58,711 unique authors and 58,796 unique sources.

The traditional procedure of cocitation analysis for both DCA and ACA comprises the following steps:

1. Retrieve citation data from sources such as the Science Citation Index (SCI), Social Science Citation Index (SSCI), Scopus, and Google Scholar.

2. Construct a matrix of cocited references (DCA) or authors (ACA).

3. Represent the cocitation matrix as a node-and-link graph or as a multidimensional scaling (MDS) configuration with possible link pruning using Pathfinder network scaling or minimum spanning tree algorithms.

4. Identify specialties in terms of cocitation clusters, multivariate factors, principle components, or dimensions of a latent semantic space using a variety of algorithms for clustering, community finding, factor analysis, principle component analysis, or latent semantic indexing.

5. Interpret the nature of cocitation clusters.

The interpretation step is the weakest link. It is time-consuming and cognitively demanding, requiring a substantial level of domain knowledge and synthesizing skills. In addition, much of attention routinely focuses on cocitation clusters per se, but the role of citing articles that are responsible for the formation of such cocitation clusters may not be always investigated as an integral part of a specialty.

Our new method extends and enhances traditional cocitation methods in two ways: (a) by integrating structural and content analysis components sequentially into the new procedure and (b) by facilitating analytic tasks and interpretation with automatic cluster labeling and summarization functions. The new procedure is highlighted in yellow in Figure 2, including clustering, automatic labeling, summarization, and latent semantic models of the citing space (Deerwester et al., 1990).

Our new procedure adopts several structural and temporal metrics of cocitation networks and subsequently generated clusters.

Structural metrics include betweenness centrality, modularity, and silhouette.

Temporal and hybrid metrics include citation burstness and novelty

The betweenness centrality metric is defined for each node in a network. It measure the extent to which the node is in the middle of a path that connects other nodes in the network (Brandes, 2001; Freeman, 1977). High betweenness centrality values identify potentially revolutionary scientific publications (Chen, 2005) as well as gatekeepers in social networks.

In the context of this study, the modularity Q measures the extent to which a network can be divided into independent blocks, i.e., modules (Newman, 2006; Shibata, Kajikawa, Taked, & Matsushima, 2008).

The silhouette metric (Rousseeuw, 1987) is useful in estimating the uncertainty involved in identifying the nature of a cluster.

Burst detection determines whether a given frequency function has statistically significant fluctuations during a short time interval within the overall time period.

Sigma is introduced in (Chen, et al., 2009a) as a measure of scientific novelty. ... In this study, Sigma is defined as (centrality + 1)burstness such that the brokerage mechanism plays more prominent role than the rate of recognition by peers.

We adopt a hard clustering approach such that a cocitation network is partitioned to a number of nonoverlapping clusters.

In this article, cocitation similarities between items i and j are measured in terms of cosine coefficients.

A good partition of a network would group strongly connected nodes together and assign loosely connected ones to different clusters. This idea can be formulated as an optimization problem in terms of a cut function defined over a partition of a network. Technical details are given in relevant literature (Luxburg, 2006; Ng, Jordan, & Weiss, 2002; Shi & Malik, 2000).

Spectral clustering is an efficient and generic clustering method (Luxburg, 2006; Ng et al., 2002; Shi & Malik, 2000). It has roots in spectral graph theory. Spectral clustering algorithms identify clusters based on eigenvectors of Laplacian matrices derived from the original network.

Spectral clustering has several desirable features compared to traditional algorithms such as k-means and single linkage (Luxburg, 2006):
• It is more flexible and robust because it does not make any assumptions on the forms of the clusters,
• it makes use of standard linear algebra methods to solve clustering problems, and
• it is often more efficient than traditional clustering algorithms.

Candidates of cluster labels are selected from noun phrases and index terms of citing articles of each cluster. These term are ranked by three different algorithms. In particular, noun phrases are extracted from titles and abstracts of citing articles. The three term ranking algorithms are tf*idf (Salton, Yang, & Wong, 1975), log-likelihood ratio (LLR) tests (Dunning, 1993), and mutual information (MI).

Each cocitation cluster is summarized by a list of sentences selected from the abstracts of articles that cite at least one member of the cluster.

In this study, sentences are ranked by Enertex (Fernandez, SanJuan, & Torres-Moreno, 2007). Given a set S of N sentences, let M be the square matrix that for each pair of sentences gives the number of nominal words in common (nouns and adjectives).

In this study, summarization sentences were also ranked by two new functions gtf and gftidf , which are further simplified approximations of the energy function E.

The ACA and DCA studies described in this article were conducted using the CiteSpace system (Chen, 2004; Chen, 2006). CiteSpace is a freely available Java application for visualizing and analyzing emerging trends and changes in scientific literature.

CiteSpace supports a unique type of cocitation network analysis—progressive network analysis—based on a time slicing strategy and then synthesizing a series of individual network snapshots defined on consecutive time slices. Progressive network analysis particularly focuses on nodes that play critical roles in the evolution of a network over time. Such critical nodes are candidates of intellectual turning points.

In summary, (a) spectral clustering and factor analysis identified about the same number of specialties, but they appeared to reveal different aspects of cocitation structures and (b) cluster labels chosen from citers of a cluster tend to be more specific terms than those chosen by human experts.

We found the comparison with the study of Zhao and Strotmann very valuable. It offered us an opportunity to compare the analysis conducted by human experts to the interpretation cues provided by our automatic labeling and summarization methods.

Spectral clustering for the purpose of network decomposition is exclusive in nature although in reality it is often sensible to allow overlapping clusters because of multiple roles individual entities may play.

Spectral clustering of cocitation networks tends to generate distinct clusters with high precision, whereas human experts tend to aggregate entities into broadly defined clusters.

In conclusion, the new cocitation analysis procedure has the following advantages over the traditional one:
• It can be consistently used for both DCA and ACA.
• It uses more flexible and efficient spectral clustering to identify cocitation clusters.
• It characterizes clusters with candidate labels selected by multiple ranking algorithms from the citers of these clusters and reveals the nature of a cluster in terms of how it has been cited.
• It provides metrics such as modularity and silhouette as quality indicators of clustering to aid interpretation tasks.
• It provides integrated and interactive visualizations for exploratory analysis.

Modularity and silhouette metrics provide useful quality indicators of clustering and network decomposition.

2015年3月24日星期二

Yan, E. (2014). Research dynamics: Measuring the continuity and popularity of research topics. Journal of Informetrics, 8(1), 98-110.

由於發現新的物種、疾病與社交模式，產生新的研究主題與專業 (Li et al., 2010; Yan, Ding, Milojevic, & Sugimoto, 2012)，經過一段時間後，相關的研究社群會成長或是規模改變，有些主題仍然持續，但有些則是消失 (Griffiths & Steyvers, 2004; Upham & Small, 2010; Shi, Nallapati, Leskovec, McFarland, & Jurafsky, 2010)。已有許多研究利用書目資料來確認研究的專業，例如Kessler (1963)的論文書目耦合網路(bibliographic coupling networks)、Small (1973)的論文共被引網路(paper co-citation networks)、White 與 McCain (1998)的作者共被引網路(author co-citation networks)以及White (2003)的尋路網路 (pathfinder networks)，Callon、Courtial 與 Laville (1991)、Ding、Chowdhury 與 Foo (2000)、Milojevic、Sugimoto、Yan 與 Ding (2011)則是使用詞語共現網路 (co-word networks)。這些研究各自在不同研究層次確認研究主題，例如論文層次有 Chen (2004, 2006)、 Kessler (1963)和 Small (1973)，作者層次有 Clauset, Newman, & Moore (2004)、White & McCain (1998)和 White (2003)，期刊層次如 Glänzel & Schubert (2003)、 Leydesdorff & Vaughan (2006)，以及領域層次有 Janssens, Zhang, Moor, & Glänzel (2009)、Rafols & Leydesdorff (2009)、Zhang, Liu, Janssens, Liang, & Glänzel (2010)。較低的研究實體層級，如論文與作者，研究可以從領域內發現其他的主題或專業；但在期刊或領域等較高的層次，通常從更完整的資料中確認出次領域。確認主題的方法則有因素分析(factor analysis)和多維尺度(multidimensional scaling)等傳統的群集技術以及連結線中心性(edge betweenness)、群組性(modularity)和混合群集(hybrid clustering)等較新技術的應用。本研究(Yan, 2014)則是利用主題模型(topic model)確認研究主題，並提出主題延續性(topic continuity)及主題普遍性(topic popularity)等兩項動態特性來分析研究主題。應用主題模型技術考察主題動態的方法，包括事後分析(post hoc analysis)(例如： Griffiths & Steyvers, 2004; Hall, Jurafsky, & Manning, 2008)、分段法(segmented approaches) (例如：Bolelli, Ertekin,Zhou, & Giles, 2009)以及連續時間模型(continuous-time model) (Wang & McCallum, 2006)等。本研究採用事後分析，利用文件中各主題的機率分布評估主題存在的機率。

本研究針對每一年分別產生一個主題模型，對於每一個主題找出後一年最有可能的主題，評估兩個主題相似的方式是利用改良自Kullback-Leibler差異 (Kullback-Leibler divergence, KLD)的Jensen–Shannon差異 (Jensen–Shannon divergence, JSD)，兩個模型P和Q的JSD計算方式為JSD(P||Q) = 1/2KLD(P||M)+1/2KLD(Q||M)，KLD是兩個模型Kullback-Leibler差異，M=1/2(P+Q)。每一個主題後一年最有可能的主題是擁有最小JSD的主題，其分數JSD稱為JJSDS，運用JJSDS的變化趨勢計算主題的連續性，然後以z score進行標準化。
評估各主題的普遍性則是計算它們在該年度文件上平均的機率值，並以z score進行標準化，愈大的機率值表示該主題在當年度愈普遍，然後分析主題普遍性的變化趨勢，。

本論文的研究資料為2001到2011年的圖書資訊學(library and information science)出版品，包括期刊論文、書評及研討會論文等，採用論文的題名做為分析資料，共27,796 篇論文。每一年的主題數目都設為20。結果顯示在網路資訊檢索(web information retrieval)、引用及書目計量學(citation and bibliometrics)、系統及技術(system and technology)、健康科學(health science)等主題有較高的平均普遍性；h指標(h-index)、線上社群(online communities)、資料保存(data preservation)、社群媒體(social media)和網站分析(web analysis)等則是圖書資訊學裡愈來愈普遍的主題。研究結果的主題與過去的研究相符合，但這篇論文的貢獻在於對於研究主題的動態進行分析。

Dynamic development is an intrinsic characteristic of research topics. To study this, this paper proposes two sets of topic attributes to examine topic dynamic characteristics: topic continuity and topic popularity.

Topic continuity comprises six attributes: steady, concentrating, diluting, sporadic, transforming, and emerging topics; topic popularity comprises three attributes: rising, declining, and fluctuating topics.

These attributes are applied to a data set on library and information science publications during the past 11 years (2001–2011).

Results show that topics on “web information retrieval”, “citation and bibliometrics”, “system and technology”, and “health science” have the highest average popularity; topics on “h-index”, “online communities”, “data preservation”, “social media”, and “web analysis” are increasingly becoming popular in library and information science.

Dynamics is a constant theme in scientific explorations. Research communities may grow or change in size; new species, diseases, or societal patterns may be discovered; and new research topics and specialties may be introduced (Li et al., 2010; Yan, Ding, Milojevic, & Sugimoto, 2012). Over time, some topics are continuously investigated while others appear or disappear (Griffiths & Steyvers, 2004; Upham & Small, 2010; Shi, Nallapati, Leskovec, McFarland, & Jurafsky, 2010). Therefore, it is of great importance to examine research dynamics to understand the evolving cognitive structures of research domains.

Pioneering studies of paper bibliographic coupling networks (Kessler, 1963), paper co-citation networks (Small, 1973), author co-citation networks (White & McCain, 1998), pathfinder networks (White, 2003) and co-word networks (e.g., Callon, Courtial, & Laville, 1991; Ding, Chowdhury, & Foo, 2000; Milojevic, ´ Sugimoto, Yan, & Ding, 2011) were capable of identifying research specialties from bibliographic data effectively.

However, findings from these studies remained largely static and thus only yielded fixed perspectives on the cognitive structure of research domains.

To examine research dynamics,this study uses a topic modeling technique and proposes two sets of topic attributes–topic continuity and topic popularity.

• How to use topic modeling techniques to study research dynamics?

• What quantitative measurements can be used to describe topic dynamics?

• What topics are present in library and information science? What are their dynamic characteristics?

This subsection reviews the network-based approaches of identifying research topics and specialties. These approaches have been applied to several research levels, including the paper-level (e.g., Chen, 2004, 2006; Kessler, 1963; Small, 1973), the author-level (e.g., Clauset, Newman, & Moore, 2004; White & McCain, 1998; White, 2003), the journal-level (e.g., Glänzel & Schubert, 2003; Leydesdorff & Vaughan, 2006), and the field-level (e.g., Janssens, Zhang, Moor, & Glänzel, 2009; Rafols & Leydesdorff, 2009; Zhang, Liu, Janssens, Liang, & Glänzel, 2010).

Most above-mentioned work used co-occurrence networks as the research instrument.

Analyses on lower level research entities, such as papers and authors, usually identified topics and specialties from small but well-defined research fields; whereas analyses on higher level research entities, such as journals and fields, attempted to identify subfields and subdomains from more comprehensive data sets.

Both classic clustering techniques (e.g., factor analysis and multidimensional scaling) as well as modern techniques (e.g., edge betweenness, modularity, and hybrid clustering) have been applied.

Recently, studies have attempted to add dynamic analyses by utilizing multiple time intervals.

Several approaches on slicing time intervals are available: intervals that have the same amount of references (e.g., Radicchi et al., 2009), intervals that have the same number of publications (e.g., Sugimoto, Li, Russell, Finlay, & Ding, 2011; Yan & Sugimoto, 2011), same-length intervals (e.g., Åström, 2007; Milojevic´ et al., 2011), and accumulative intervals (e.g., Barabási et al., 2002; Yan & Ding, 2009).

These studies laid valuable methodological basis for dynamic analyses of cognitive structures of research fields; however, networks of different time frames were largely analyzed distinctively and a more integrated examination was lacking.

In the meantime, empirically, network-based clustering results may require domain expertise to effectively interpret obtained results.

Topic modeling techniques use probabilistic models to assign papers, journals, or authors to clusters. A topic can be defined as a probability distribution over terms in a vocabulary (Blei & Lafferty, 2007). Latent Dirichlet Allocation (LDA) model, a classic topic model, was proposed by Blei et al. (2003). The model predicates that words for each paper are derived from a mixture of topics and each topic follows a multinomial distribution.

One recent update of the LDA model is the supervised LDA model. It makes the analyses of multi-labeled corpora (e.g., tags from delicious.com and various classifications) possible. Blei and McAuliffe’s (2010) version of supervised LDA can successfully address this challenge, but a document can only be assigned with one label.

Ramage, Hall, Nallapati, and Manning (2009) offered an approach which enabled the multi-label assignment. Their supervised labeled LDA (L-LDA) associated one label with one topic and allowed the model to learn word-label relations.

Through topic modeling techniques, topic dynamics has been examined mainly through the following approaches: post hoc analysis (e.g., Griffiths & Steyvers, 2004; Hall, Jurafsky, & Manning, 2008), segmented approaches (e.g., Bolelli, Ertekin,Zhou, & Giles, 2009), and continuous-time model (Wang & McCallum, 2006).

Post hoc analysis uses topic-document probability distributions to evaluate the presence of identified topics.

Segmented approaches build the dynamic component in the probabilistic model. It assumes that the state of topics at a single time point is independent from all other time points and divides document corpora into segments that have contingent time stamps (Bolelli et al., 2009).

Continuous-time model is a non-Markov model proposed by Wang and McCallum (2006), where they found the non-Markov model provides better prediction and more interpretable topical trends.

In this study, a post hoc dynamic analysis using the ACT model is selected because of its marked performance (Tang et al., 2008) as well as its advanced input and output support.

Topic dynamics is calculated through the Author-Conference-Topic (ACT) model (Tang et al., 2008).

Specifically, i is the topic distribution for document i. Mean ( ¯), therefore, is a direct quantitative measurement to assess topic popularity: the higher the ¯, the more visible the topic, and thus the more popular that topic is (Griffiths & Steyvers, 2004).

Because the data set spans 11 years, 11 independent ACT models were run, one for each year of the data set based on year of publication.

The Jensen–Shannon divergence (JSD) was used as the similarity measurement to quantify the topic similarity between different word-topic distributions. ... JSD is a symmetrized and smoothed version of the Kullback–Leibler divergence (KLD). ... As a divergence measure, the smaller the JSD, the higher the similarity is.

In order to track the same topic from two adjacent time intervals, the minimum value for each row of a JSD matrix was used, referred to as the joint JSD score (JJSDS): MIN(JSD Matrix(i,j)), for j = 1:n. ... Applying the same approach to each pair of adjacent time slices, for each topic, an array of JJSDS can be obtained.

The attributes of steady, concentrating, and diluting topics focus on the overall topical characteristics whereas the attributes of sporadic, transforming, and emerging topics focus on the topical characteristics of a specified time frame. Therefore, these attributes are not mutual exclusive, suggesting that a topic can be a concentrating topic overall, and in the meantime, related topics were added and thus qualifying it for a transforming topic.

The data set contains publications of all journals indexed in the 2011 version of the Journal Citation Report in the Information Science & Library Science subject category. Articles, proceeding papers, and review articles published within these journals from 2001 to 2011 were downloaded for analysis (downloading time: October 2012). Stop words were then removed from publications’ titles. Publications without titles, authors, or journal names were removed from the data set. The final data set comprised 27,796 papers.

The number of topics is set at 20: this number considers the size of the paper corpus as well as previous empirical studies on the cognitive structure of library and information science (e.g., Milojevic´ et al., 2011; Sugimoto et al., 2011; White & McCain, 1998; Zhao & Strotmann, 2008). For reasons of consistency, the same number of topics was identified for each year of the data set.

In this subsection, we first present histograms made from values in Jensen–Shannon divergence (JSD) matrices (Fig. 4). These histograms provide a direct perception on how research topics in library and information science are related as measured by JSD. This subsection then introduces all 20 topics in each year from 2001 to 2011 as well as how topic continuity and popularity attributes are applied to these topics (Fig. 5).

Fig. 4 uses histograms to visualize JSD values for each pair of adjacent years. Because there are 20 topics for each year, the number of data points in each histogram is 400 (20 × 20). This number is 4000 for the histogram in the lower right section of Fig. 4, as it uses JSD values for all pairs of adjacent years.

This study finds that in library and information science, research topics on “web information retrieval”, “citation and bibliometrics”, “system and technology”, and “health science” have the highest average popularity over the past decade (from 2001 to 2011).

Research on “h-index”, “online communities”, “data preservation”, “social media”, and “web analysis” are increasingly becoming popular topics.

Overall, findings of this study are consistent with previous studies using co-word, co-citation, and topic modeling techniques.

For instance, a co-word study by Milojevic´ and colleagues (2011) has found that title terms “citation”, “impact factor”, and “web” have a rising usage from 1989 to 2008.

Other related dynamic studies that cover the target time frame of the current study (2001–2011) include Åström’s (2007) study on examining library and information science research front, where the study found that webometrics and information-seeking and retrieval have become dominating research areas between 2000 and 2004.

This finding has been verified by Klavans and Boyack (2011) where the authors used the global map (i.e., the map of science) to enhance to accuracy of local maps (i.e., the contextual map of information science). They identified five core areas in information science, including information-seeking behavior, computer-enhanced retrieval, scientometrics, co-citation analysis, and citation behavior.

Besides the contextual analysis of information science, structural analysis has also been achieved from a time-series empowered author co-citation and document co-citation analysis (Chen, Ibekwe-SanJuan, & Hou, 2010). Through the application of a series of structural metrics such as centrality measures, modularity and silhouette, a clear cognitive structure of information science was attained in that the research areas of interactive information retrieval, academic web, information retrieval, citation behavior, and h-index have gained a particular popularity from 1996 to 2008.

In addition to journal publications, Sugimoto and colleagues (2011) applied a LDA model to library and information science dissertations and demonstrated dissertations as an important communicative genre. Their study indicated that between 2000 and 2009, internet and information retrieval related topics were the central dissertation research themes.

The contribution of the current study is that it proposes two sets of quantitative topic attributes. These attributes have streamlined the dynamic analysis of research topics and specialties and have further complemented co-occurrence-based studies.

This paper has identified dynamic characteristics of topics in library and information science; however, limited information can be told about the mechanisms that resulted in such characteristics. That being said, the study is unable to pinpoint, for instance, whether the growing popularity of network and citation studies is the result of a growing research community, a drive by the commercial market, a stimulus from funding agencies, or a combination of these or other unlisted factors.

Popular topics may be associated with research communities that are expanding in size and/or tend to have higher productivity. Conversely, less popular topics may be associated with communities that are shrinking and/or have a reduced productivity. Topic continuity and popularity attributes reflect research specialties’ development in scientific communities, which is further guided by science policies and the attention of the general public.

In informetrics, studies have mainly focused on analyzing the performance and the social and cognitive implications of several types of research entities, including papers, authors, institutions, journals, and fields. Authors and institutions are typically used to examine social relations in academia; while journals and fields are predominantly used to investigate the cognitive structure of research domains.

Topic analysis can precisely provide a more refined assessment by clustering research papers based on certain probability distributions. Because of such quantitative results, a more integrated dynamic cognitive analysis is thus possible, as exemplified through the current study.

Topic analysis will be further developed by overlaying topics with author communities to explore the interwoven relationships between research topics and research communities (e.g., Yan et al., 2012); by overlaying topics with funding data to investigate the “lead-lag” relationship between funding support and productivity (e.g., Shi et al., 2010); by applying topic models to different genres to study research immediacy (e.g., Ding et al., 2013); and by overlaying topics with citation data to examine the relationships between topics and impact.

訂閱：文章 (Atom)

2015年3月30日 星期一

Chen, C., Ibekwe-SanJuan, F. and Hou, J. (2010), The structure and dynamics of cocitation clusters: A multiple-perspective cocitation analysis. Journal of the American Society for Information Science and Technology, 61 (7), 1386–1409. doi: 10.1002/asi.21309

2015年3月24日 星期二

Yan, E. (2014). Research dynamics: Measuring the continuity and popularity of research topics. Journal of Informetrics, 8(1), 98-110.

2015年3月30日星期一

2015年3月24日星期二