2013年4月8日 星期一

Yan, E., Ding, Y., Milojević, S., & Sugimoto, C. R. (2012). Topics in dynamic research communities: An exploratory study for the field of information retrieval. Journal of Informetrics, 6(1), 140-153.

Yan, E., Ding, Y., Milojević, S., & Sugimoto, C. R. (2012). Topics in dynamic research communities: An exploratory study for the field of information retrieval.Journal of Informetrics6(1), 140-153.

information visualization

本研究整合社群偵測(community detection) (Clauset, Newman, & Moore, 2004)和主題確認(topic identification) (Tang et al., 2008)等兩個演算法,藉以探討研究社群與研究主題之間的彼此交織(interwoven)且共同演化(co-evolving)的互動關係。目前已經有相當多研究分別對於科學研究的研究社群和主題確認進行探討,前者例如利用作者的合著網絡(coauthorship networks)進行社群偵測來找出社群間互動模式(patterns of community interactions)的研究(Girvan & Newman, 2002; Richardson, Mucha, & Porter,2009; Pepe & Rodriguez, 2010),後者的研究則有利用詞語的機率分布做為研究領域上各個主題的統計模型(model)(Blei, Ng, & Jordan, 2003)。在整合兩種研究方向上,Racherla and Hu (2010)對於作者合作的主題進行單一或多樣性的探討,Steyvers, Smyth, Rosen-Zvi, and Griffiths (2004)將作者的統計模型描述為主題上的機率分布,稱為作者-主題模型(Actor-Topic Model)。根據作者-主題模型,McCallum, Corrada-Emmanuel, and Wang (2004)進一步提出作者-接收者-主題模型(Actor-Recipient-Topic Model),將主題描述為詞語的multi-normal分布,而每一對作者和接收者則是在主題上的分布。Tang et al. (2008)則是將作者-主題模型加以擴大,提出作者-研討會-主題模型(Actor-Conference-Topic Model),以機率模型同時描述文章內容、作者興趣以及研討會(期刊)。
本研究以上述的研究為基礎,以2001年到2007年的資訊檢索(information retrieval, IR)領域為研究案例,將這期間出版的論文分為三個時期(2001-2003, 2004-2005, 2006-2007),找出每個時期在IR領域的主要研究社群和主題,並探討主題之間以及每一個社群與主題之間的相關性。本研究使用Clauset et al.’s (2004)提出的社群偵測演算法從三個時期的作者合著網絡的最大相連成分上發現主要的研究社群,Clauset et al.’s (2004)演算法屬於目前常用的模組性(modularity)最大化的社群偵測演算法,每一個時期找出最大的10個社群。確認研究主題的方式則是使用Tang et al., (2008)的ACT模式,使得每一個作者在每個主題上有一個機率分布的模型。每個時期確認出10個最主要的主題,將每個時期的主題與同一時期的其他主題利用Cosine對它們的統計模型進行相似性的比較,結果發現除了這些主題彼此間的相似性都很低,顯示ACT模式能夠很好地區辨出不同的主題。比較不同時期的主題可以發現主題間的延續關係,主題和它們的延續者之間的相關性較高。研究結果指出較普遍的主題具有較多的延續主題,較不普遍的主題則僅有一個延續主題,甚至沒有。在比較研究社群和主題時,發現規模較大的社群,因為包含較多的作者而涉及較多元的研究興趣,所以很難有集中的研究主題;反之,規模較小的社群其研究主題較集中。從這個研究裡可以看出作者傾向於和有相似研究專長且發表相近主題論文的其他作者進行合作。最後,這個研究也指出某些社群致力於相當獨特、其他社群相當少涉及的研究主題,例如生物醫學(biomedical)相關的主題。

This paper examine show research topics are mixed and matched in evolving research communities by using a hybrid approach which integrates both topic identification and community detection techniques. Using a data set on information retrieval(IR) publications,two layers of enriched information are constructed and contrasted: one is the communities detected through the topology of coauthorship network and the other is the topics of the communities detected through the topic model.

The complexity of scholarly data has led to a growing interest in applying probabilistic models to identify topics from documents. A topic represents an underlying semantic theme and can be informally approximated as an organization of words and can be formally operationalized as a probability distribution over terms in a vocabulary (Blei & Lafferty, 2007)

Topic models are the latest advancement in this vein of research (e.g. Blei, Ng, & Jordan, 2003). Topic models provide useful descriptive statistics for a collection of scholarly data, thus making it easier for scholars to navigate academic documents. The outcomes of topic models are probability distributions of words or publications for each topic (e.g. Blei et al., 2003); however, they provide no information on which community contributes to a certain topic or how topics are developed by communities.

Research communities can be detected using community detection methods to group actors, such as authors and journals, with the goal of identifying patterns of community interactions.

Leskovec, Lang, Dasgupta, and Mahoney (2008) used the concept of “conductance” to capture a community: a good community should have small conductance, i.e. “it should have many internal edges and few edges pointing to the rest of the network” (p. 4).

A decisive advance in community detection was made by Newman and Girvan (2004),who introduced a quantitative measure for the quality of partitioned communities, a.k.a. the modularity.

In reality, communities and topics are not disconnected; on the contrary, communities and topics are interwoven and co-evolving: that is, a research community can carry several topics, and a topic can consist of different collaboration groups (Li et al., 2010a). Therefore, in order to study the interdisciplinary nature of science, it is necessary to integrate the two threads of research on community detection and topic identification, and utilize them to understand the dynamic interactions between topics and communities.

Racherla and Hu (2010) constructed a topic similarity matrix by assigning a predefined research topic to each document and its authors, and using authors’ collaboration information to link topics. They found that authors not only collaborate on the same research topics but also collaborate on varied research topics.

Upham, Rosenkopf, and Ungar (2010) developed an iterative clustering scheme that produces high-quality dynamic clusters over time. Using such an approach, twenty-one research communities were detected in the information science and technology area. Innovation performance was then quantified by various parameters and measured for each of these clusters.

Pepe and Rodriguez (2010) conducted an in-depth study of a small collaboration network of researchers in the area of sensor networks and wireless technologies. They adopted the notion of discrete assortativity coefficient to evaluate the collaboration pattern in this network. They found that its collaboration has become more intra-institutional and more inter-disciplinary.

Built upon previous endeavors on graph partitioning, Girvan and Newman (2002) proposed an algorithm that uses edge betweenness to identify the boundaries of communities. They applied the method to a scientific collaboration network at the Santa Fe Institute, and identified several densely connected communities. They found that scientists are grouped together either by a similar research topic or by a similar research methodology, where the latter situation may be an indication of interdisciplinary work.

The Girvan–Newman algorithm is computationally time demanding and is optimized into a more efficient algorithm (Clauset, Newman, & Moore, 2004). The new algorithm incorporated modularity, now becoming a standard measure to evaluate community structures. For instance, Richardson, Mucha, and Porter (2009) found their spectral graph-partitioning algorithm can yield higher-modularity partitions. They applied their method to a coauthorship network of network scientists and found three well-known research centers in network science. However, from their findings, it is unclear whether the three locations also form three distinct research topics or how the research centers are connected via topics.

Similar to the methods used in detecting author communities, scholars working on identifying topics have used methods such as multidimensional scaling (e.g. White & McCain, 1998), k-means (e.g. Yan, Ding, & Jacob, in press), modularity-based clustering techniques (e.g. Van Eck & Waltman, 2010), and hybrid approaches (e.g. Janssens, Glänzel, & De Moor, 2008).

Upham and Small (2010), for instance, gave a good quantitative definition of growing, shrinking, stable, emerging, and exiting research fronts.

Traditionally, the research instruments they utilize are mainly co-occurrence networks, for instance, author co-citations networks (White & McCain, 1998), document co-citation networks (Klavans & Boyack, 2011; Small, 1973; Upham & Small, 2010), journal co-citation networks (Ding, Chowdhury, & Foo, 2000a), or co-word relations (Ding, Chowdhury, & Foo, 2000b; Milojevic, Sugimoto, Yan, & Ding, 2011).

Boyack and Klavans (2010) examined several types of scholarly networks, including a cocitation network, a bibliographic coupling network, and a citation network, in the interest of selecting the network that can represent the research front in biomedicine. They used within-cluster textual coherence and grant-to-article linkage indexed by MEDLINE as accuracy measurements and found that the bibliographic coupling-based citation-text hybrid approach, an approach that couples both references and words from title/abstract, outperformed other approaches.

Janssens, Glänzel, and De Moor (2007, 2008) proposed a novel hybrid approach that integrates two types of information, citation (in the form of a term-by-document matrix) and text (in the form of a cited references-by-document matrix). Noticing that the weighted linear combinations may “neglect different distributional characteristics of various data sources” (p. 612), the authors developed a new approach named Fisher’s inverse chi-square method. This method can effectively combine matrices with different distributional characteristics. They found the hybrid approach outperformed the text-only approaches by successfully assigning papers into correct clusters.

Followed by the tradition of data mining and knowledge discovery, topic models have gained great popularity among computer scientists in recent years. One wellknown topic model is the Probabilistic Latent Semantic Indexing (pLSI) model proposed by Hofmann (1999). Built on pLSI, Blei et al. (2003) introduced a three-level Bayesian network, called Latent Dirichlet Allocation (LDA). In topic models, topics are modeled as a probability distribution over terms in a vocabulary.

Steyvers, Smyth, Rosen-Zvi, and Griffiths (2004) proposed an unsupervised learning technique for extracting both the topics and authors of documents. In their Author-Topic model, authors are modeled as probability distributions over topics.

McCallum, Corrada-Emmanuel, and Wang (2004) presented the Author-Recipient-Topic (ART) model, a directed graphical model which conditions the per-message topic distribution jointly on both the author and individual recipients. In ART model, each topic is modeled as a multi-nomial distribution over words, and each author–recipient pair is modeled as a distribution over topics.

The Author-Conference-Topic (ACT) model, proposed by Tang et al. (2008), further extended Author-Topic model to include conference/journal information. The ACT model utilizes probabilistic models to model documents’ contents, authors’ interests, and also conference/journal simultaneously.

To understand the interaction between research communities and research topics, there is a need to incorporate both community detection and topic modeling approaches. For instance, Zhou, Ji, Zha, and Lee Giles (2006) and Zhou, Manavoglu, Li, Lee Giles, and Zha (2006) proposed two generative Bayesian models for semantic community detection in social networks by combining probabilistic modeling with community detection algorithms. Their method was able to detect the communities of individuals and meanwhile provide topic descriptions to these communities.

Information retrieval (IR) was chosen as the target domain. Papers were collected from Scopus for 2001–2007 (inclusive)... Time slices were set as 2001–2003, 2004–2005, and 2006–2007 so that each slice has similar number of authors, thus
providing comparable networks. Authors in the largest component (LC) were finally selected to form the coauthorship networks (Table 1).

Clauset et al.’s (2004) method was implemented to the coauthorship networks for each time period. The modularity for weighted networks can be calculated as (Clauset et al., 2004): ... Formula (1) is the fraction of within-community edges minus the expected value of the same degrees of vertices randomly connected between the vertices

An extended stop word list is used to exclude common words in IR, including information, retrieval, system, search, and model. The ACT model (Tang et al., 2008) was used to detect topics. In the ACT model, each author is associated with a multi-nomial distribution over topics and words in a paper and the conference stamp is generated from a sampled topic.

In this way, the posterior distribution of topics depends on three modalities: authors, words, and conferences (or journals). The model begins with the joint probability of the whole data set, and then using the chain rule, the posterior probability of sampling the topic and author for each word can be obtained.

The next step was to overlay research topics for the detected communities. The procedures were: (1) search and collect publications for all authors in the top ten communities in each time slice; (2) apply the ACT model to publications of each
time slice with the number of topics set at ten; (3) generate an topic-author distribution (P(topic | author)) using the ACT model where each author obtains a topic distribution vector (for author i: ai = (t1, t2, . . . , t10)), and set up a threshold and replace those probabilities that below the average 0.1 (1/10) to 0; by doing so, the insignificant probabilities will not be counted and will not add noise to the community similarity calculation; (4) extract and average the topic distributions for authors of a community where the mean is considered as the community’s topic distribution vector, and then normalize the vector so that the sum of each vector is one; (5) calculate cosine similarities for communities.

An increasing number of words have been added to the knowledge domain of IR over time: from 3785 in 2001–2003 to 9794 in 2006–2007, indicating an expanded research scope of IR scholars. Around half of the words used in the earlier period are inherited by the next period—the other half is abandoned. In addition, 10% of the words in 2001–2003 were not mentioned in 2004–2005 but regained attention in 2006–2007.

For topics belonging to the same time period (the three blocks located on the diagonal line), most topics have low similarities with other topics. It is a good sign in that the ACT model has successfully identified distinguishable topics.

For topics belonging to different time periods, it can be found that some topics have evident successors (bright squares) while other topics fail to proceed into the next time period (dark squares). In addition, it can also be found that topics with high popularities tend to have multiple successors and topics with low popularities tend to have only one or none successor.

Topic popularity is predicated on the ACT model. The underlying assumption is that if the words belonging to a certain topic occur more frequently, then this topic has high popularity. Since ten topics are set, a topic popularity of 0.1 means this topic has an average popularity. A value above 0.1 suggests a “hot” topic and a value below 0.1 suggests a “cold” topic.

Topics in high popularities are well connected: the top five topics in each time period have predictors and/or successors. However, topics in low popularities are loosely connected, suggesting that they did not receive continuous attention.

Communities of smaller sizes tend to have evident topical concentrations, which is understandable as communities of larger sizes are more likely to involve scholars with diverse research interests. For example, in 2001–2003, Community 7 is specialized in Topic 10, Community 9 is specialized in Topic 7, and Community 10 is specialized in Topic 6; comparatively, the top three communities in 2004–2005 and 2006–2007 did not yield evident topical concentrations.

In regard to topics, most topics are associated with at least one community. For example, Topic 9 in 2004–2005 is studied by Community 4, and Topic 4 in 2006–2007 is studied by Community 10. The results indicate that authors are more inclined to collaborate with others who have similar expertise and publish papers on similar topics. In addition, smaller communities tend to have relatively distinct research topics.

A few communities (Community 7, 9, and 10 in 2001–2003; Community 4 and 10 in 2005–2006; Community 6 and 10 in 2006–2007) concentrates on relatively unique topics and has lower level of topical similarity with other communities,
especially for biomedical related topics, such as Community 4 in 2004–2005 is highly specialized in “database-protein-gene-expression-mining”, Community 10 in 2004–2005 is highly specialized in “medical-health-clinic-systematic-biomedical”, and Community 10 in 2006–2007 “application-memory-optical-remote-imaging”.

沒有留言:

張貼留言