2013年11月29日 星期五

Zhang, H., Qiu, B., Giles, C. L., Foley, H. C., & Yen, J. (2007, May). An LDA-based community structure discovery approach for large-scale social networks. In Intelligence and Security Informatics, 2007 IEEE (pp. 200-207). IEEE.

Zhang, H., Qiu, B., Giles, C. L., Foley, H. C., & Yen, J. (2007, May). An LDA-based community structure discovery approach for large-scale social networks. In Intelligence and Security Informatics, 2007 IEEE (pp. 200-207). IEEE.

本研究以LDA (latent Dirichlet allocation) 演算法[17]做為社群發現(community discovery)的方法,也就是找出社會網路上在群體內高度相連但在群體間相對稀疏的行動者(actors)社群。在這個方法裡,社群被視為是LDA模型裡的隱藏變數,由所有行動者依不同比例組合成的混合(mixtures)。本研究認為這個方法的優點是只需要利用網路的型態資訊(topological information)便可進行社群發現;此外,有別於[2, 3, 5]等先前的方法,這個方法能夠描述行動者屬於多個社群的現象,並且在每個社群上有不同重要性。

本研究將所有與某個行動者互動的行動者以及他們之間的社會互動情形定義為這個行動者的社會互動特徵(social interaction profile, sip),例如行動者vi的社會互動特徵定義為

SIW(viwij)是行動者vi與另一位行動者wij間的互動情形,mi是與vi互動的行動者數目。
在利用LDA模型進行社群發現時,將每個行動者的社會互動特徵視為是原先模型中的文件,這些行動者發生的互動則是模型中的詞語,值得一提的是在社會網路上行動者間的互動順序是可以交換的(exchangeable),因此LDA模型相當適合這種情形。下圖是這個模型的圖示,


對行動者vi而言,其社會互動特徵sipi的產生過程(generative process)如下
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)
也就是
1) 從Dir(β)中產生行動者的組合ϕk來代表K個社群中每一個社群;
2) 從Dir(α)中選擇一個社群的組合θi,做為所有與vi互動的行動者所可能屬於的社群的分布情形;
3) 從Poisson(ξ)中選擇社會互動的個數Ni ;
4) 產生每一個社會互動 wij時:
a) 從Multinomial(θi)中選擇一個社群ιij
b) 從Multinomial(ϕιij)中選擇一個社會互動 wij

根據這個模型,行動者vi的社會互動特徵sipi中的第j個社會互動元素wijwm的機率是

此處θisipi 的混合比變數(mixing proportion variable),而ϕk是第k個社群成分分布(component distribution)的參數集合。

給定超參數(hyperparameter)αβ後,所有已知和隱藏變數的聯合機率(joint probability)為

由於要精確地推論LDA模型的變數一般而言是很困難的(intractable),因此經常以 變異期望值最大化(variational expectation maximization) [17], 期望值延遲 (expectation propagation) [24]和 Gibbs取樣 (Gibbs sampling) [25], [26], [22]等三種方法取得近似解。本研究採用的方法是Gibbs取樣,以Markov鏈 Monte Carlo模擬(Markov-chain Monte Carlo simulation)的方式,去逼近ϕk,wθm,k等變數


本研究針對CiteSeer和NanoSCI兩個書目資料集中的作者合著網路(co-authorship networks)上最大的成分進行社群發現,其中CiteSeer部分共包含249866個節點,而NanoSCI則有203762個節點。本研究並且提出了01-SIP、012-SIP和 k-SIP等三種描述行動者間的社會互動情形,01-SIP是描述兩個作者有直接合著的社會互動情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為1,否則便設為0;012-SIP則考慮兩個作者沒有直接合著但曾經分別和第三位作者合著的情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為2,雖然這兩位作者不曾合著,但曾有分別和第三位作者合著,便將互動情形設為2,否則便設為0;k-SIP則描述兩位作者多次合著的情形,k是他們合著的次數。研究結果發現在複雜度(perplexity)與各產生社群的緊密度(compactness of communities)等指標,012-SIP的成效都比其他兩者好,從社群發現的結果可以發現許多社群裡的作者是屬於相同的研究機構或是彼此具有相似的研究興趣,因此會共同研究並且合作發表而成為社群。

This paper describes an LDA(latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA(Simple Social Network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical model and defined as distributions over the social actor space. The advantage of SSN-LDA is that it only requires topological information as input.

This model is evaluated on two research collaborative networks: CiteSeer and NanoSCI. The experimental results demonstrate that this approach is promising for discovering community structures in large-scale networks.

An important task in these emerging networks is community discovery, which is to identify subsets of networks such that connections within each subset are dense and connections among different subsets are relatively sparse.

Unlike those previous community discovery studies, we design a hierarchical Bayesian network based approach, namely SSN-LDA (Simple Social Network-LDA) to discover probabilistic communities from social networks. ... In this model, communities are modeled as latent variables and are considered as distributions on the entire social actor space.

We also propose three different approaches to create social interaction profiles based on the social interaction information in the network.

Girvan et al extended this measure to edges and designed a clustering algorithm which gradually remove the edges with highest betweenness value [5]. ... However, a major problem with this approach is that the complexity of this approach is O(m2n), where m is the number of edges in the graph and n is the number of vertices in the network.

The graph partition problem can be formulated as the balanced minimum cut problem where the goal is to find an optimal graph partition so that the edge weight between the partitions is minimized while maintaining partitions of a minimal size. The NP-complete complexity of this approach [15] requires approximate solutions. Flake et al developed approximate algorithms to partition the network by solving s-t maximum flow techniques [2], [3]. The main idea behind maximum flow is to create clusters that have small inter-cluster cuts and relatively large intra-cluster cuts.

The major difference between SSN-LDA approach and the aforementioned approaches is that SSN-LDA is a mixture-model based probabilistic approach. Each community weighs in the contributions from every social actors and this property can be exploited in many potential applications that will be introduced in Section VI.

LDA model was first introduced by Blei for modeling the generative process of a document corpus [17].

Among these variants of LDA models, the approaches proposed in [20], [10] are both concerned about the authors of the documents in the corpus. In particular, Zhou et al introduced a community latent variable in their graphical model and applied it to discover community information embedded in document corpus. This approach can discover the underlying social network based on social interactions and topical similarity.

In their follow-up work[23], Zhou et al. investigated how research topics evolve over time and attempted to discover the most influential researchers involved in such transitions.

Each actor is characterized by its social interaction profile (SIP), which is defined as a set of neighbor(wij) and the corresponding weight(SIW(vi, wij)) pair.

where mi is the size of vi's social interaction profile.

Note that we consider the social interaction elements in this profile are exchangeable and therefore their order will not be concerned. It is this exchangeability that permits the application of LDA model [17].

Subsequently, we specify that a social network contains a set of communities ι (ι1, ι2, …, ιk) and each community in ι is defined as a distribution on the social actor space. In SSN-LDA, community assignments are modeled as a latent variable (ι) in the graphical model. The community proportion variable (θ) is regulated by a Dirichlet distribution with a known parameter α. Meanwhile, each social actor belongs to every community with different probabilities and therefore its social interaction profiles can be represented as random mixtures over latent communities variables.

The SSN-LDA model for social network analysis is illustrated in Fig. 1. Note that SSN-LDA resembles topic-based LDA model[17], with the social network being analogous to the corpus, the social interaction profiles being analogous to documents; and the occurrence of social interactions being analogous to words.

The distribution of topics in documents and the terms over topics are two multinomial distributions with two Dirichlet priors, whose hyperparameters are α and β respectively.

The dimensionality K of the Dirichlet distribution, which is also the number of community component distributions, is assumed to be known and fixed.

This generative process for an agent(wi)'s social interaction profile sipi in a social network is:
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)

According to the model, the probability that the jth social interaction element wij in the social actor wi's social interaction profile sipi instantiates a particular neighboring agent wm is:

where θi is the mixing proportion variable for sipi and ϕk is the parameter set for the kth community component distribution.

Given the hyperparameters α and β, the joint distribution of all known and hidden variables is:


Exact inference is generally intractable for LDA model. There have been three major approaches for solving this model approximately, including variational expectation maximization [17], expectation propagation [24], and Gibbs sampling[25], [26], [22].

Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) simulation[27] where the dimension K of the distribution are sampled alternately one at a time, conditioned on the values of all other dimensions[22].

we apply the Gibbs sampling algorithm that has been introduced in [26], [22] to solve the SSN-LDA model and reduce the computation requirement.



Subsequently, the update equation for the hidden variable can be derived [22]:


where n(.)i is the count that does not include the current assignment of ιi and recall that sip is the variable for social interaction profiles. For the sake of simplicity, we assume that the Dirichlet distribution is symmetric in deriving the above formula.

Finally, the update formula for ϕk,w and θm,k are as follows:



In co-authorship networks, the vertices represent researchers and the edges in the network represent the collaboration relation between researchers. In this section we evaluate SSNLDA model on co-authorship networks collected from two distinct areas: computer science(CiteSeer) and nanotechnology(NanoSCI). Note that no name disambiguation has been done on either dataset.

The size of the largest connected subnetwork of CiteSeer is 249866 while the size of the largest connected subnetwork in NanoSCI is 203762. In this paper, we are only interested in discovering community structures in the two largest subnetworks.

In this paper, we explore three different types of social interaction pro le representations for social networks, namely 01-SIP, 012-SIP, and k-SIP.

In the 01-SIP approach, an edge is drawn between a pair of scientists if they coauthored one or more articles. Collaborating multiple times does not make a difference in this model

In order to mitigate this problem, we propose a 012-SIP model which takes a node's neighbors' neighbors into consideration. ... In this model, we distinguish a node's direct neighbors from its neighbors' neighbors by giving different weights to them.

This section describes a K-SIP model where the weight information for an edge is defined as the times of the collaboration between the two authors.

And then, 10% of the original datasets is held out as test set and we run the Gibbs sampling process on the training set for i iteration. In particular, in generating the exemplary communities, we set the number of the communities as 50, the iteration times i as 1000. In perplexity computation, i is set as 300 in order to shorten the computation time. In both case, α is set as 1/K and β is set as 0.01, where K is the number of the communities.

Table III shows 6 exemplary communities from a 50-community solution for the CiteSeer dataset with social interaction profiles being created using 012-SIP representation. ...These exemplary communities give us some flavor on the communities that can be discovered by this approach. Specifically, we observe that some communities are “institution-based”, some others are “topic-based. ... This observation reveals the fact that researchers from same institution or with similar research interests tend to collaborate together more and build closer social ties.

Perplexity is is a common criterion for measuring the performance of statistical models in information theory. It indicates the uncertainty in predicting the occurrence of a particular social interaction given the parameter settings, and hence it reflects the ability of a model to generalize unseen data.

Perplexity PP is defined as

where wm is the social interaction profiles in the test set and

where n(v)m is the number of times term t has been observed in document m.

It shows that the perplexity value is high initially and decreases when the number of communities increases. In addition, the results show that the 012-SIP approach has lower perplexity value than the other two approaches.

Compactness of a community is measured through the average shortest distance among the top-ranked Nr researchers in this community. Short average distance indicates a compact community.

The t-test results show that the 012-SIP approach is significantly better the other two approaches for both datasets

The probabilities that can be derived from SSN-LDA model can be helpful in determining the importance and roles of community members. For instance, the importance of community members conditioned on the community variable ιj can be measured through the probability p(wi|ιj ), which can be easily derived from this model based on the learned ϕ.

SSN-LDA model provides an elegant way to measure the similarity of two communities by calculating the corresponding KL (Kullback-Leibler) distance and convert it to similarity measure. KL divergence is a distance measure for two distributions and the corresponding formula for calculating the distance between two communities ιi and ιj is:



This paper describes an LDA (latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA(Simple Social Network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical models and defined as distributions over social actor space. The advantage of SSN-LDA is that it only requires topological information as input.

2013年11月23日 星期六

Cha, Y., & Cho, J. (2012, August). Social-network analysis using topic models. In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (pp. 565-574). ACM.

Topic Model & LDA

本研究建議運用LDA(latent Dirichlet allocation)模型對社交網絡上圖式資料(graph)的節點與連結線進行分群(grouping)以及標示(labeling),藉以改善K-means [15] 及 DBSCAN [6]等常用方法無法對節點進行多重分群或是無法同時標註節點與連結線等問題。

過去在社交網絡上應用LDA模型的研究大多是根據作者-主題模型(author-topic model, AT model)[22]修改發展而來,著名的研究包括社群使用者主題(community user topic, CUT)模型[27]、作者-接收者-主題(author-recipient-topic, ART)模型與角色-作者-接收者-主題(role-author-recipient-topic, RART)模型[16]以及社群-作者-接收者-主題(community-author-recipient-topic, CART)模型[19]。此外,Zhang et al. [26]和Henderson et al. [9]的研究都是將LDA模型應用於學術社交網絡(academic social networks),Zhang et al. [26]的研究目的是探討將合著資訊(co-authorship)轉換為圖式資料的方法,Henderson et al. [9]的重點則是在處理圖式資料中大量的較低連結的節點。

圖式資料的分群問題如下圖所表示,F與G分別是連結線開頭與結尾的節點所成的集合,E則是連結兩邊節點的連結線所成的集合,Z代表的是連結線的標示以及節點的分群。這個研究要探討的問題是如何標示F與G的節點之間的連結線?並且如何根據連結線的標示結果將G內的節點分群?

如果某一節點fi連結到另一個節點gj,是因為figj中的某一個主題有共同的興趣。一個人可能會對多個主題感興趣,因此節點fi會連結到會來自不同主題的節點,就如同文件中會包含不同主題的詞語一樣,並且這些主題的分布為多元常態分布(multinormal distribution)。同樣的情形,每一個被連結到的節點可以出現在多個不同的主題上。另一方面,許多人會對某個特定主題有不同程度的興趣,因此每個主題內包含的節點也呈現多元常態分布,就像主題可以利用多種不同的詞語來表示,並且有些詞語會比其他的詞語相關。因此圖式資料的分群與標示問題可以將F與G內的節點分別類比成文件以及文件上出現的詞語,並且應用LDA模式的概念,利用得到的隱藏主題做為連結線的標示,並且對G內的節點的節點分群。

但是由於LDA本身是基於文件生成而發展的模型,在這個模型裡,出現在過多文件上的詞語通常比較無法代表特定的主題,被認為較不重要的詞語,因此往往會預先被過濾,不參與模型的訓練;然而在諸如社交網絡的圖式資料連結許多節點的節點則往往能夠代表多個較重要的主題,因此在將LDA模型應用於圖式資料的節點與連結線分群及標示應用時,應該納入這些連結較多的節點進行分析。本研究改良了LDA模型來解決上述的問題,使其可以運用於社交網絡平台上的朋友推薦(friend recommendations)功能。

本研究提出四種改良方案:
1) 以不對稱的先驗機率(a prior)取代對稱的先驗機率(a prior),使得連結較多的節點具有較高的先驗機率,例如使用下面的式子做為先驗機率的估計值。


2) 利用階層式LDA(hierarchical LDA)模型,使得較常出現的主題在上層,比較特定的主題則在下層。

3) 採用兩階段標示(two-step labeling)方式,第一階段先排除連結較多的節點以及連結到這些節點的連結線,只利用連結較少的節點資料建立主題,避免連結較多的節點的干擾,第二階段則利用第一階段建立好的主題,以下面的公式對連結較多的節點的連結線進行標示。

此處P(e(fg) = z|. )為將從 f 到g的連結線標示為某一個主題z的機率,NgzNfz分別表示gf被標示為z的次數。

4)對每一個節點設定如下面公式中的閾值(threshold),當節點被標示為某一主題的次數超過此一閾值時才採入計算,讓連結數量較多的節點可以僅指定幾個較有可能的主題。另一種方式則是節點被標示為某一主題的次數只採計前面幾個次數較多的主題,其餘的主題均不納入計算。


本研究從2009年十月到2010年一月間2億7千三百萬餘筆的Twitter資料中選取1千萬筆資料進行實驗,標示的主題個數為100,研究結果發現不管是複雜度(perplexity)或是分群的品質,後兩種方法(兩階段標示以及雜訊閾值過濾)都比其他的方法有較好的成效。

In this paper, we discuss how we can extend probabilistic topic models to analyze the relationship graph of popular social-network data, so that we can "group" or "label" the edges and nodes in the graph based on their topic similarity.

In particular, we first apply the well-known Latent Dirichlet Allocation (LDA) model and its existing variants to the graph-labeling task and argue that the existing models do not handle popular nodes (nodes with many incoming edges) in the graph very well. We then propose possible extensions to this model to deal with popular nodes.

Our proposed methods can be used for providing, for instance, more relevant friend recommendations within a social network.

In this paper, we explore techniques to provide more structure to this follow relationship (1) by "grouping" the users based on their topic interests, and (2) by "labeling" each follow relationship with the identified topic group.

Roughly, we can consider our goal as a clustering (or classification) problem, where many popular solutions such as K-means [15] and DBSCAN [6] exist. These existing methods, however, are not appropriate for our task because they either (1) associate each node with a single group (hard clustering) or (2) can associate each node with multiple groups (soft clustering), but require a completely separate method to label edges as well as nodes (since a node may be associated with multiple groups).

In this paper, we apply a well-known probabilistic topic model, called Latent Dirichlet Allocation (LDA), to the follow relationship graph of the social network, in order to label the nodes and the edges in the graph with (possibly) multiple topics. ...In particular, the direct application of LDA to our task requires that every node in the graph should be of roughly equal popularity and that we should remove nodes of high popularity from the dataset. This is particularly problematic because these popular nodes are really the ones that we want to label accurately; many users are particularly interested in identifying the topic groups of these popular users. Earlier work on the application of the LDA model to social graph [9, 26] has not addressed the handling of popular nodes.

Recently, Mislove et al. [18] identified communities in Facebook based on the social graph structure and inferred unknown attributes through this community information. The methods they used to cluster communities are based on pruning edges from the whole social graph and adding edges from some seed nodes, both of which are very common and widely used approaches in social-network analysis. However, these approaches produce mutually exclusive groups and cannot support multiple memberships, which is important in our scenario where users have a variety of interests.

Though not directly related to social-network analysis, the concept of author/user was initially introduced in the Author-Topic (AT) model [22]. It was used to extract hidden research topics and trends from CiteSeer's abstract corpus.

Zhou et al. [27] modified the AT model and proposed the Community User Topic (CUT) model to capture semantic communities.

McCallum et al. [16] extended the AT model and proposed the Author-Recipient-Topic (ART) model and the Role-Author-Recipient-Topic (RART) model in order to analyze the Enron e-mail corpus and an academic e-mail network.

Pathak et al. [19] modified the ART model and suggested the Community-Author-Recipient-Topic (CART) model similar to the RART model.

Besides these members of the AT model family, Wang et al. [24] introduced the Group Topic (GT) model and applied it to voting data from US Senate and the General Assembly of the UN.

Mei et al. [17] also introduced a regularized topic modeling framework incorporating a graph structure in the data.

Other LDA extensions and probabilistic topic models were also proposed for annotation data analysis [12], chat data analysis [23], tagging data analysis [8], and pairwise data analysis [2].

Perhaps the work by Zhang et al. [26] and by Henderson et al. [9] is closest to our work. In both works, the authors applied LDA to academic social networks. However, their respective focus was quite different from ours.

For example, [26] focused on the issue of how to convert the co-authorship information into a graph (e.g., direct co-authorship or indirect co-authorship, and edge weighting scheme based on collaboration frequency).

Henderson et al. [9] addressed the issue of a large number of topic clusters generated due to low popularity nodes in the network, while our primary focus is the effective clustering of high popularity nodes.

Because LDA evolved from Probabilistic Latent Semantic Indexing (PLSI) [11], we rst describe the concept of PLSI and then explain what di erentiates LDA from PLSI. ... PLSI introduced a probabilistic generative model to topic models. Equation (1) represents its document generation process based on the probabilistic generative model:

P(d, w) is the probability of observing a word w in a document d and can be decomposed into the multiplication of P(d), the probability distribution of documents, and P(w|d), the probability distribution of words given a document. This equation describes a word selection for a document, where we first select a document then a word in that document. If we iterate this selection multiple times, we can generate a document and eventually a whole document corpus.

By assuming that there is a latent topic z, we can rewrite the equation above with the multiplication of P(w|z), the probability distribution of words given a topic, and P(z|d), the probability distribution of topics given a document. This equation describes adding an additional topic selection step between the document selection step and the word selection step. As there are multiple latent topics where a word may come from, we sum the multiplication over a set of all the independent topics Z.

Though PLSI is equipped with a sound probabilistic generative model and a statistical inference method, it suffers from the overfitting problem and does not cope well with unobserved words.

To solve this problem, Blei et al. [4] introduced Dirichlet priors α and β to PLSI, to constrain P(z|d) and P(w|z), respectively.

α is a vector of dimension |Z|, the number of topics, and each element in α is a prior for a corresponding element in P(z|d). Thus, a higher αi implies that the topic zi appears more frequently than other topics in a corpus.

Similarly, β is a vector of dimension |W|, the number of words, and each element in β is a prior for a corresponding element in P(w|z). Thus, a higher βj implies that the word wj appears more frequently than other words in the corpus.

As a conjugate prior for the multinomial distribution, the Dirichlet distribution can also simplify the statistical inference. By placing Dirichlet priors α and β on the multinomial distributions P(z|d) and P(w|z), those multinomial distributions are smoothed by the amount of α and β and become safe from the overfitting problem of PLSI.

Thus, we use the term follow when a user adds another user as her friend. Formally, when a user f follows a user g, f generates a follow edge, or simply an edge, e(f; g) from a follower f to a followed user g. We also use e'(f; g) to denote an edge from g to f (indicating that g is followed by f), e(f) to denote the set of all outgoing edges from f, and e'(g) to denote the set of all incoming edges to g. To
refer to the set of all followers, the set of all followed users, and the set of all edges in the dataset we use F, G, and E, respectively.

Figure 1(a) depicts this notation using a graph that we refer to as a subscription graph. ... Given this subscription graph, our goal is to label each edge with a correct label (interest) and group (label) each followed user based on those labeled edges. Now we can frame our problem as the graph labeling problem of automatically associating each user gi in G with a set of accurate interests zk in Z based on its labeled incoming edges e'(gi). (We also label fj in F as well.)
Furthermore, we can consider a follower f to be a document d, a followed user g as a word w, and a list of followed users for the follower as the content of the document.

However, there is a subtle difference between the two generative processes. While words are sampled with replacement in the standard LDA, followed users are sampled without replacement in our model. For example, in a document generative model, a document may contain the word car multiple times. On the other hand, in our edge generative model, a user cannot follow the same user Barack Obama multiple times. As a result, the probabilistic distribution in our model does not follow a multinomial distribution but follows a multivariate hypergeometric distribution. Fortunately, the multinomial distribution can also be used for our model because it is known that a multivariate hypergeometric distribution converges to a multinomial distribution as the sample size grows large [1]. In our case, since sampling is done on millions of nodes, the two distributions become practically indistinguishable.

Also, when we represent E in matrix form by putting F in the rows and G in the columns as E (belongs) power(B, FXG), where B = {0,1}, some differing aspects are noticed:

As in a matrix formed from a document corpus, the distribution of the column sums show a power-law distribution, in which some small portion of the columns accounts for a majority of the total column document corpus, these columns correspond to words such as the and is, which we call stop words, and are generally removed. Those columns in our matrix, however, correspond to users such as Barack Obama and Britney Spears, which we can call popular users, and should be taken into account.

In a document corpus analysis, the stop words are generally removed before analysis, since an analysis without removing these stop words produces a very noisy result where the frequent stop words are labeled with every topic [4]. However, in our analysis, popular users are very important to include in the analysis because most users are keenly interested in following famous and influential users whose topic interests are similar to theirs. Unfortunately, it is not sufficient to simply include the popular users for LDA analysis, because the inclusion of popular users produces the same noisy result seen in the text analysis case: when stop words (or popular users) are included, they get included in every topic group, producing a very noisy result.

Though each element of vectors α and β may take different values in principle, in the standard LDA, each element of α and β is assumed to have the same value (often referred to as the symmetric prior assumption). Intuitively, this assumption implies that every topic and word in a document corpus is equally likely. ... As a higher prior value implies a higher likelihood of being observed in the corpus, we set each prior value proportional to the number of incoming edges of each followed user. It is expected to associate popular users with more accurate labels as they are given adequate prior values.
We set βgi , the prior for the followed user gi in the vector , as in Equation (3):

Hierarchical LDA (HLDA) is also a good candidate for our problem because it generates a topic hierarchy and more frequent topics are located at higher levels.

In HLDA, when a document is generated according to Equation (1), words are chosen from topics in a document path. Since the top level topic is associated with all the documents, common words in every document (i.e., stop words) are expected to be labeled with the top level topic. On the contrary, the bottom level topics are expected to be more specific as they are associated with a small number of documents.

This topic hierarchy is established because HLDA is based on the Nested Chinese Restaurant Process (NCRP), a tree extension to Chinese Restaurant Process, which probabilistically generates a partition of a set {1, 2, ......, n} at time n. In NCRP, a document is considered as a Chinese restaurant traveler who visits L restaurants along a restaurant tree path, where L refers to the level of the tree (i.e., the length of the path).

We decompose the labeling process into two sub-processes of establishing topics and labeling users with the established topics. In the first topic establishment step, we run LDA after removing popular users from the dataset similar to how we remove stop words before applying LDA to a document corpus. This step generates clean topics free from the noise generated by popular users. In the second labeling step, we apply LDA only to popular users in the dataset. As we use the collapsed Gibbs sampling algorithm [21], edges to popular users are labeled according to the pre-established topics as represented in Equation (4).

where P(e(fg) = z|. ) denotes the probability of labeling the edge from a follower f to the followed user g with a topic z, given all conditions, Ngz denotes the number of times g is labeled with z, and Nfz denotes the number of times f is labeled with z.

As briefly described in Section 3.1, the association between a user (a word) and a topic has varying association strengths represented by P(g|z) (P(w|z)) in a probabilistic topic model. Thus, we can list users labeled with the same topic in descending order of P(g|z) and regard the top entries in the list (topic group) as more important than the entries at the bottom because they are more strongly associated with the topic. Similarly, we can measure the association strength from the user's viewpoint using P(z|g). Though two-step labeling may help label popular users with the right topics, popular users may take top positions even in less-relevant topic groups because even the smallest number of times a popular user is assigned to a topic group could outnumber the largest number of times a non-popular user is assigned to that topic group.

To mitigate this problem, we propose a new approach, threshold noise filtering, which sets a cut-off value to determine whether to label a user with each topic. By ignoring assignments below the cut-off value, we can expect smoothing and noise reduction effects as in the anti-aliasing filter.

We set Ngizk, the number of times a followed user gi is assigned to a topic group zk, as 0 if it is below the cut-off value C:

As threshold noise filtering process is done after sampling, it does not increase computational complexity similar to two-step labeling.

Alternatively, we may filter out less relevant topics by keeping only the top-K topics for each user, for a reasonably small K value. We tested both schemes (threshold noise filtering and top-K filtering), but we couldn't see any practical differences.

Though threshold noise filtering can be used with any other approaches, we combine it with the two most representative cases, two-step labeling and the standard LDA in our experiments.

For our experiments, we use a Twitter dataset we collected between October 2009 and January 2010. The original downloaded dataset contained 273 million follow edges, but we sampled 10 million edges from this dataset to keep our experiment manageable. To ensure that all the follow edges were preserved in our sampled dataset, we first sampled followers randomly and included all follow edges from the sampled users, until we obtained 10 million follow edges.

From the graph, it is clear that the number of incoming edges follows a power-law distribution, which is often the case for this type of dataset. Interestingly, we observe that the number of outgoing edges is quite uniform between edge counts 1 to 100, which is different from the distribution reported in [14]. We do not believe this difference is due to our sampling, because the graph from the complete dataset shows the same at curve between the edge counts 1 and 100 [25].

Perplexity is a well-known standard metric used in IR. It tries to quantify the accuracy of a model by measuring how well the trained model deals with an unobserved test data. More precisely, perplexity is defined to be [26]:

where Etest denotes all the edges in the test dataset. We calculated Perplexity values for the 10% held-out dataset after training the models on the remaining 90% dataset. In general, lower perplexity means better generalizability.

Note that the original LDA model is designed to derive an optimal model that minimizes perplexity. Therefore, it is unlikely that extensions to base LDA show lower perplexity. Our primariy goal of comparing the results under the perplexity metric is to see whether our proposed extensions significantly degrade perplexity.

From the graphs, we first notice that HLDA (hlda-3lv and hlda-2lv) shows significantly worse Perplexity than others. This is due to the fact that the standard LDA is designed to minimize Perplexity while HLDA is not necessarily designed for this task.

Overall, we find that our two-step labeling and threshold noise filtering show similar perplexity values to the standard LDA model. That is, our extensions to LDA do not introduce noticeable Perplexity degradation to the standard LDA model.

To measure human-perceived quality, we conducted a survey with a total of 14 participants. The participants in our survey were presented with a random group of ten Twitter users (identified by one of the eight approaches described in Section 5.1) and were asked to indicate if each user in the topic group was relevant to the group or not. Overall, we collected 23 judged topic groups per each approach with a total of 161 judged topic groups.

Both of our extensions, two-step labeling and threshold noise filtering are very effective in improving the human-perceived quality of identified topic groups.

As two-step filtering initially forms clean topics free from popular user generated noise, its gain is more significant than that of threshold noise filtering. However, threshold noise filtering can be easily combined with any other approaches to improve Quality, as in filter-base.

If we apply the standard LDA to popular users as well without any pre-filtering, the produced topic groups are quite noisy.

Using an asymmetric Dirichlet prior for the hyperparameter β improves Quality, but the result is still noisier compared to our extensions.

The HLDA model shows only marginal improvements in the Quality metric when compared to base. This low performance is because most of the low-level groups in HLDA have users with very few followers (who are less likely to be interesting when recommended) and contribute only small weights to Quality.

Note that many users in this group are very popular and they are mainly about the same topic, tech media, indicating that 2step is able to group popular users into the right topic group in general. However, we observe that 2step still su ers from the presence of a few popular, yet less relevant users in the group (in this example, cnnbrk and breakingnews may be considered less relevant to the group than others)

With threshold noise filtering, we achieved less noisy results. Figure 12 shows a result topic group from filter-2step corresponding to the group in Figure 11 from 2step. We observe that cnnbrk and breakingnews, popular users on general media, are now removed from Figure 11 and more technology-centric media such as firefox, youtube, and engadget are added.

Even with this relatively limited type of data compared to those of previous approaches, our approaches generated very well-organized results and showed the great potential of applying LDA to these kinds of clustering applications. Our approaches are especially useful when only linkage data is available.

2013年11月21日 星期四

Henderson, K., & Eliassi-Rad, T. (2009, March). Applying latent Dirichlet allocation to group discovery in large graphs. In Proceedings of the 2009 ACM symposium on Applied Computing (pp. 1456-1461). ACM.

Henderson, K., & Eliassi-Rad, T. (2009, March). Applying latent Dirichlet allocation to group discovery in large graphs. In Proceedings of the 2009 ACM symposium on Applied Computing (pp. 1456-1461). ACM.

LDA-G(Latent Dirichlet Allocation for Graphs的簡稱)利用主題模型演算法(topic modeling algorithm)在實際的圖式資料(graph data,也就是網絡資料)發現隱藏群組結構(latent group structure)。

LDA主要應用於文本語料,這個方法假設在一個生成模型(generative model)中,語料庫內的每一文件由對於分布形式為多元常態分布(multinormal distribution)的多個主題重複取樣而得到,並且文件中每個詞語對應的主題在生成這個詞語時也是根據一個多元常態分布。整個生成模型可以簡化如下:

在推估後驗機率(posterior probability)時,可以如[4]所建議的方式,使用Gibbs sampling。

在應用LDA於圖式資料,假設節點間為有方向性的連結線(無方向性的連結線可以視為兩條有方向性的連結線),以連結出去的起始節點做為主題模型演算法中所謂的文件,被連結的結束節點則做為詞語。
由於文本語料和圖式資料有一些不同,因此需要對LDA進行修改。本研究列出文本語料和圖式資料的不同點,並且探討這些不同點對於在圖式資料中應用LDA的影響或是需要修改LDA的地方:
1. 文本語料中每個文件包含的詞語種類數都很多;然而根據冪次法則(power law),一般的圖式資料上大多數的節點擁有連出節點的數目都很少。這種情形在使用Gibbs sampling時將會產生為數相當多的群體,對於實際的圖式資料,這些多餘的群體並沒有顯著的有意義(significant meaning),因此在對圖式資料的型態(topology)沒有提供進一步的預測能力。
2. 許多詞語在文件中會重複出現,LDA也利用了這個特性;然而在一般的社交網絡上大多假設連結線為單一值。在實驗的情況中,連結線是否為多重值並沒有產生任何重大的影響,因此本研究建議當連結線為多重值時可以將它們轉換成單一值。
3. 詞語的順序與語意有某種程度的關連,在圖式資料上每個節點所連結的節點則無順序可言。然而LDA並沒有利用詞語的順序資訊,所以並沒有影響。

在運用LDA-G對圖式資料發現群組時可以使用有限的參數模型(a finite parametric model)或是無限的無參數模型(infinite nonparametric model),兩種模型的差異在於群組(也就是LDA的主題)數量的決定方式。有限的LDA-G模型由某些條件(例如連結的預測能力大小)來決定最佳的群組數量。無限的LDA-G模型則是將群組數量也視為要由Gibbs sampling演算法學習的變數之一。本研究使用的方式為有限模型。

本研究使用的研究資料是PubMed資料庫,針對三種圖式資料:1) Author x Knowledge: 是指從作者連結到知識主題(knowledge theme)的圖式資料,如果在資料庫中至少有一份文件,某一位作者u是它的作者,而且它的摘要內有包含知識主題k,則從u到k有一條連結線。在這個圖式資料上共有37,346位作者以及117個知識主題,並且共有119,443條連結線。 2) Author x Author: 是合著網絡(coauthorship network),如果資料庫裡有兩位作者合著的論文,他們的節點間便會有連結線。這個圖式資料有37,227位作者,並且共有143,364條連結線。3) Knowledge-Infused Author x Author: 這個圖式資料是在原先的Author x Author圖式資料加入有共同知識主題的作者之間的連結線,也就是如果兩位作者不曾合著過,但他們都曾撰寫某一個知識主題相關的文件達到十二次以上,他們之間便增加一條連結線。這個圖式資料擁有37,227個作者,但連結線增加339,644條。研究針對三種圖式資料測試LDA-G以及IRM (Infinite Relational Model)
和 Cross-association等群組發現方法,測試它們的連結線預測能力,以ROC曲線(ROC curve)下的面積做為比較的標準。結果發現平均而言,LDA-G的連結線預測能力比IRM好15%,並且比Cross-association好25%。

This paper introduces LDA-G, a scalable Bayesian approach to finding latent group structures in large real-world graph data. ... LDA-G (short for Latent Dirichlet Allocation for Graphs) utilizes a well-known topic modeling algorithm to find latent group structure. Specifically, we modify Latent Dirichlet Allocation (LDA) to operate on graph data instead of text corpora. Our modifications reflect the differences between real-world graph data and text corpora (e.g., a node’s neighbor count vs. a document’s word count).

In our empirical study, we apply LDA-G to several large graphs (with thousands of nodes) from PubMed (a scientific publication repository). We compare LDA-G’s quantitative performance on link prediction with two existing approaches: one Bayesian (namely, Infinite Relational Model ) and one non-Bayesian (namely, Cross-association). On average, LDA-G outperforms IRM by 15% and Cross-association by 25% (in terms of area under the ROC curve).

A canonical form of topic modeling is the latent Dirichlet allocation (LDA) [2]. LDA is a hierarchical nonparametric Bayesian approach to topic discovery in text corpora. It assumes a generative model in which each document di from a corpus samples (latent) topics from a multinomial distribution Mult(θ), and each topic zi samples words in the vocabulary from a multinomial distribution p(w|zi).

Gibbs sampling, such as in [4], can be used to estimate the posterior probability on configurations of the model.

A given configuration is an assignment z =< z1z2, ..., zn >, where each entry corresponds to the topic of a given word in the corpus.
The full generative model for a simplified version of LDA is as follows:

The runtime complexity of this inference approach is O(NKM).
For LDA, N is the number of documents, K is the number of topics, and M is the average length of documents.
For LDA-G, N is the number of nodes, K is the number of groups, and M is the average node degree in the graph.
The space complexity for both is O(N(K +M)).

There are three primary differences between real-world graphs (e.g., social networks, citation networks, etc) and text corpora.

The first difference is the distribution of node degrees vs. distribution of words. Text corpora usually have at least tens or hundreds of words in even the shortest documents. Social networks tend to have power-law distributions [9], with the modal degree being 1 and the mean degree being very low.
The degree distribution, however, does have an effect on performance. In particular, when there are hundreds of thousands of low-degree nodes in a network, the Gibbs sampler for LDA tends to generate thousands or tens of thousands of groups. The reason for this can be readily seen by looking at the sampling equations in [4]. This is a problem for group discovery, since these extra “topics” or groups have no significant meaning for real-world graphs, and do not provide additional predictive capacity on the topology of the graph.

The second major difference is that many real-world graphs, especially social networks, are not represented as multigraphs. Clearly, most documents of interest have some words that appear multiple times; and LDA takes advantage of this when modeling the topics.
The multigraph aspect is a valid consideration; but in our experiments (see Section 5), it does not have any major effect on performance. In fact, when we analyzed a real-world multigraph using LDA-G, it generated a more predictive model when all multiple edges were transformed into single edges. Thus, the multigraph discrepancy can be safely ignored.

Finally, text documents are semantically sensitive to word ordering, whereas in social networks, no order is assigned to the edges of a given vertex.
Word order is not part of the LDA model, so it can be disregarded in our adaptation.

To perform group discovery on graphs, LDA-G can use either a finite parametric model or an infinite nonparametric model.

In the finite case, the model is truncated at some number of groups, K, which is chosen on a per-application basis. There are several considerations which can affect the proper choice of K. For our purposes, we use link prediction on a tuning set as the main criterion.

The infinite LDA-G model learns the appropriate model size through yet another hyperparameter and alterations to the Gibbs sampler update.

In our experiments (see Section 5), we find finite LDA-G with model selection to be superior to the full nonparametric version.

Author x Knowledge Graph: This is a bipartite graph with author nodes connected to knowledge theme nodes. A link exists from an author u to a knowledge theme k for every paper in which u is a coauthor and k is a theme appearing in the abstract. This graph is a multigraph, but as noted above we summarize any multiple edges as single edges. This graph contains 37,346 authors and 117 knowledge themes, with 119,443 author-knowledge edges (see Figure 2).

Author x Author Graph: This graph is a coauthorship network, two author nodes are connected if they published a paper together. This symmetric graph has a clique for each publication. It contains 37,227 authors and 143,364 edges (see Figure 2). Note that there are some authors who appear in this graph but not in the Author x Knowledge Graph and vice versa.

Knowledge-Infused Author x Author Graph: This graph is constructed by infusing the Author x Author Graph with information from the Author x Knowledge Graph. We generate this graph as follows. First, we prune the Author x Knowledge multigraph by removing links that appear less than 12 times. We pick the threshold of 12 based on graphsize considerations. This step effectively removes noise in the Author x Knowledge Graph. Second, in the Author x Author Graph, we add a link between any pair of authors that share a knowledge theme in the pruned Author x Knowledge graph. This produces a denser “knowledge-infused” Author x Author Graph with 37,227 authors and 339,644 edges (see Figure 2).

In particular, we use stratified random sampling to hold-out some links from the input graph. The remaining links are used to discover the latent groups. Then, the superiority of these groups is checked based on how well they predict the existence of the held-out links.

For LDA-G, it is the average over all sampled configurations of Σt∈topics p(t|vi)p(vj |t).

In all experiments, we construct the ROC curve on a test set of unobserved links (randomly selected from the graph data) and report the average Area Under the ROC Curve (AUC) over 5 independent trials.
This paper describes LDA-G, a scalable and expressive approach to this problem. LDA-G modifies a popular topic modeling algorithm for graph data. Comparative experiments with IRM (an alternative Bayesian approach) and Cross-association (a compression-based approach) illustrate the superiority of LDA-G in terms of both quantitative and qualitative results.

2013年11月20日 星期三

Sugimoto, C. R., Li, D., Russell, T. G., Finlay, S. C., & Ding, Y. (2011). The shifting sands of disciplinary development: analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology, 62(1), 185-204.

Sugimoto, C. R., Li, D., Russell, T. G., Finlay, S. C., & Ding, Y. (2011). The shifting sands of disciplinary development: analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology, 62(1), 185-204.

本研究利用北美在1930到2009年間完成的3121筆博士論文,探討圖書資訊學(library and information science, LIS)主要研究主題的變化情形。過去在研究LIS領域的主題時,內容分析(content analysis)和共被引分析(cocitation analysis)是主要的研究方法,針對一個時期內的期刊論文進行分析,發現該時期主要的研究主題。重要的內容分析研究例如 Enger, Quirk, & Stewart (1989)、 Järvelin & Vakkari (1990, 1993)、Kumpulainen (1991)、 Hider & Pimm (2005)和  Fidel (2008);以期刊進行書目計量分析的研究有 Åström (2007, 2010)的兩篇論文;以論文作者進行書目計量分析的研究則包括Pettigrew & Nicholls (1994)、White & McCain (1998)、 Bates (1998)、 Budd (2000)、White (2001)、Levitt & Thelwall (2009a, 2009b)和Åström (2010)。目前大多數的研究有下列的限制:1)雖然大多數的研究都以期刊論文為研究資源,但已有許多研究(Bazerman, 1988; Hyland, 2000)指出不同文類(genres)的書寫與引用模式皆有不同,只有針對一種文類進行分析便很有可能只會產生單一觀點;2)除了以期刊論文為主要的研究資料外,大多數的研究並且以少數高被引的作者或隨機選取的論文做為代表性的資源,但由於分析的資料數量不大,容易受到少數的影響,使產生的結果可能無法代表全體的情形;3)目前的研究大多是針對一個時期的同時性(synchronic)研究,缺乏以趨勢變化分析為主的歷時性(diachronic)研究,少數的歷時性研究有Smeaton, Keogh, Gurrin, McDonald, & Sødring (2003)針對SIGIR研討會論文探討資訊檢索領域在25年間的變化,Sugimoto & McCain (2010)同樣探討資訊檢索領域的主題發展情形,Harter & Hooten (1992) 將1972-1990年分為三個時期研究the Journal of the American Society for Information Science & Technology articles上的論文與作者在書目計量上的變化,Järvelin and Vakkari (1993)則是分別對三個時期的期刊論文進行內容分析, Åström’s (2007) 將 LIS領域相關的期刊論文分為三個時期進行共被引分析。除了這些限制以外,上述研究的另一個問題是這種方法間接透過論文與作者進行分析,並非直接針對主題進行研究。先前Braam, Moed,&van Raan (1991a, 1991b)的研究嘗試利用詞語的共同出現做為分析的資訊來探討領域內的重要研究主題,然而Leydesdorff (1997)認為在不同的文件累積範圍內,單一詞語的出現有相當不同的意義,因此這樣的分析有待商榷。

本研究的研究方法是以LDA(latent Dirichlet allocation) (Blei, Ng, & Jordan, 2003)來確認各時期的隱藏主題,並且確認各個主體的代表論文。LDA是一種文件生成模型,目前已被廣泛地應用於主題的確認與分析,例如Rosen-Zvi, Grifftihs, Steyvers,&Smyth (2004)曾運用LDA探討作者和主題之間的關係;Tang, Jin, & Zhang (2008)延伸這樣的概念到學術網路(academic networks)上;McCallum,Wang, & Corrada-Emmanuel (2007)和Li et al., (2010)則分別運用LDA於社交網絡(social networks)和社會標籤社群(social tagging communities)。Blei & Lafferty (2007)則運用LDA了解各主題彼此間的相關性(correlations),Pruteanu-Malinici, Ren, Paisley,Wang, & Carin(2010)和Rzeszutek, Androutsos, & Kyan (2010)的研究關注在主題在時間上的變化。為了了解LDA進行主題確認的可行性,Griffiths & Steyvers (2004)和Zheng, McLean, & Lu (2006)將LDA產生的主題與現有的論文資料分類進行比較,也證明了這種方法的確可行。 LDA將文件表示成由隱藏的主題隨機混合而成,也就是每個文件包含多個詞語,並且每一個主題由詞語依不同比例分布。文件上的所有詞語是從這個文件對應的主題組合中依據它們的比例重複地隨機抽取而得,配合主題對應的詞語比例選出。本研究使用的LDA方法是由Rosen-Zvi, Grifftihs, Steyvers,&Smyth (2004)所擴充的作者-主題模型(author-topic models),在這個模型中不僅允許一個文件可以包含多個主題,它描述了一個文件可能具有多位作者,並且每位作者也可以針對多個主題書寫的情形。下圖是作者-主題模型的Bayesian網路模型示意圖:

1. 對某一位作者x來說,Θ 是選擇某一個主題z的機率分布。
2. 對某一個主題z來說,φ 是選擇某一個詞語w的機率分布。
3. ad 是文件的多位作者,x是從這些位作者當中隨機選取的一位。

在估算上述LDA模型裡的給定某一位作者後選擇某一個主題的機率(Θ)以及給定某一個主題後選擇某一個詞語的機率(φ)等兩個未知參數時可以使用Gibbs取樣演算法(Gibbs sampling algorithm),這個演算法使用連續Markov鏈取樣(successive Markov chain sampling),根據所有其他變數,重複抽取一對的作者(x)與主題(z),以實際上所得到的數值來估算

此處,nw[m][j]是第m個詞語被指定給第j個主題的次數,nkwsum[j]是所有詞語被指定給第j個主題的次數總和,V是詞彙的大小,也就是語料內共有多少種詞語種類的數量,na[x][j]是第j個主題被指定給作者x的次數,naksum[x]是所有主題被指定給作者x的次數總和,T是主題的數量。
根據上面的過程,φ與Θ的估算方式如下


在實際研究上,可以根據複雜度(perplexity)測量模型的成效來選擇主題的數量,愈小的複雜度表示模型的成效愈好。

本研究將1930-2009年區分為五個時期,每一個時期設定為50個主題,選擇該時期機率較大的五個主題視為是該時期的主要研究主題,並且對每個主題選出機率較大的詞語來了解該主題。研究發現如下圖

從開始時期(1930-1960)到現在(2000-2009)LIS的主題有本質上的改變,然而也有圖書館史(library history)、引用分析(citation analysis)與資訊檢索(information retrieval)等出現在多個時期內,這些可視為LIS的核心主題。

This work identifies changes in dominant topics in library and information science (LIS) over time, by analyzing the 3,121 doctoral dissertations completed between 1930 and 2009 at North American Library and Information Science programs.

The authors utilize latent Dirichlet allocation (LDA) to identify latent topics diachronically and to identify representative dissertations of those topics.

The findings indicate that the main topics in LIS have changed substantially from those in the initial period (1930–1969) to the present (2000–2009). However, some themes occurred in multiple periods, representing core areas of the field: library history occurred in the first two periods; citation analysis in the second and third periods; and information-seeking behavior in the fourth and last period.

Many evaluations of library and information science (LIS) have been conducted, primarily using the methods of content analysis and cocitation analysis on journal articles (e.g., Järvelin & Vakkari, 1993; White & McCain, 1998).

Although these studies constitute one lens on the field, there are some major limitations to the current literature in the area.
First, the focus on a single communicative genre (the journal article) provides a monocular view of the field. Research has shown that the writing and citing patterns of authors vary significantly by genre (Bazerman, 1988; Hyland, 2000). A different topic spectrum may be found by examining topics across multiple genres.
Second, the focus has been on either a group of highly cited authors or a sample of journal articles. Previous analyses have been manually intensive, necessitating small sample sizes. This has the potential to skew the results in two ways: (a) highly cited works are not necessarily representative of the works produced, and (b) a few articles/authors can heavily influence the results.
Lastly, the analyses have been largely synchronic, rather than diachronic. Therefore, trend data rely on replication studies, which are not prevalent in the literature.

Many quantitative analyses have been conducted to analyze the domain of LIS: content analyses of journal articles (e.g., Enger, Quirk, & Stewart, 1989; Fidel, 2008; Hider & Pimm, 2005; Järvelin & Vakkari, 1990, 1993; Kumpulainen, 1991), bibliometric analyses of journal articles (e.g., Åström, 2007, 2010), and bibliometric analyses of authors (e.g., Åström, 2010; Bates, 1998; Budd, 2000; Pettigrew & Nicholls, 1994; Levitt & Thelwall, 2009a, 2009b; White, 2001; White & McCain, 1998) to provide large-scale descriptions of the field.

Some analyses have focused on particular journals (e.g., Harter & Hooten, 1999; Lipetz, 1999; Liu, 2002; Park, 2010), conference proceedings (e.g., Smeaton, Keogh, Gurrin, McDonald, & Sødring, 2003), subject areas (e.g., Sugimoto & McCain, 2010), or countries (e.g., Cano, 1999; Uzun, 2002).

Scholars have also performed bibliometric analyses to examine the relationship between LIS and other disciplines (e.g., Borgman & Rice, 1992; Ellis, Allen, &Wilson, 1999; Meyer & Spencer, 1996; Odell & Gabbard, 2008; Sugimoto, Pratt, & Hauser, 2008).

The majority of quantitative analyses of LIS share four things: (a) the journal article is the focal communicative genre, (b) they are synchronic, rather than diachronic, (c) they focus on relationships between journals and/or journal authors (rather than topic analysis), and (d) those focusing on topic analysis use methods of co-occurrence or content analysis.

Some notable exceptions (on at least one of the four points) are the works by Smeaton et al. (2003) and Sugimoto and McCain (2010), which looked at changes in topics in information retrieval over time; Harter and Hooten’s (1992) bibliometric study of the Journal of the American Society for Information Science & Technology articles for three time slices; Åström’s (2007) cocitation analysis of LIS journal articles for three periods; and Järvelin and Vakkari’s (1993) content analysis of journal articles for three periods.

Some work has been done to examine the value of various methods of topic analysis, comparing the results found through cocitation with co-word analysis (e.g., Braam, Moed,&van Raan, 1991a, 1991b) and the difference between using titles, author-supplied, or indexer-supplied keywords (Whittaker, Courtial, & Law, 1989). ... Scholars have also criticized the use of co-occurrence analysis of terms, noting the large variance in meanings of individual terms based on the level of textual aggregation under investigation (Leydesdorff, 1997).

However, the manual intensity of content analysis becomes difficult, as the number of dissertations in the discipline has gone from a few hundred to a few thousand. Although content analysis can provide high granularity for individual works, it becomes difficult to assess the entire body of work without automatic techniques.

Latent Dirichlet allocation was proposed by Blei, Ng, and Jordan (2003) as a generative probabilistic model useful for discovering underlying topics in collections of data.

Expansions of LDA have also been used to understand correlations between topics (Blei & Lafferty, 2007), authors (Rosen-Zvi, Grifftihs, Steyvers,&Smyth, 2004), academic networks (Tang, Jin, & Zhang, 2008), social networks (McCallum,Wang, & Corrada-Emmanuel, 2007), social tagging communities (Li et al., 2010), and changes in topic over time (Pruteanu-Malinici, Ren, Paisley,Wang, & Carin, 2010; Rzeszutek, Androutsos, & Kyan, 2010).

Two exceptions to this are Griffiths and Steyvers’ (2004) analysis of abstracts from the Proceedings of the National Academy of Science (PNAS) from 1991–2001 and Zheng, McLean, and Lu’s (2006) analysis of the bioinformatics literature (from MEDLINE abstracts). These studies found that LDA performed well, in that the latent structure mimicked some characteristics of explicit structure (such as categorization schemes). Moreover, the studies displayed the ability of LDA to analyze the rich underlying structures of the domain— depicting emerging and sustained trends in a given discourse.

In LDA, a topic is characterized by a distribution over words and then documents are represented as random mixtures over latent topics (p. 996). As a three-level hierarchical Bayesian model, each topic node is sampled repeatedly. The result is that words may be repeated within topics and documents may be associated with more than one topic.

This model was extended to what is called the author–topic model (Rosen-Zvi et al., 2004), which is used in the present analysis. In this model, not only is each document a mixture of probabilistic topics, but each author is also seen as a mixture of probabilistic topics. In the same way that a topic can be generated from multiple topics, this extended model recognizes that an author can be “about” multiple topics (within a single document or across documents).

The author–topic model allows us to not only examine which topics were most salient across the various periods, but also which authors are most associated with these topics.



The figure can be explained as follows:
1. Θ is the probability of a topic given an author x; α is a hyperparameter for Θ.
2. φ is the probability of a word w given a topic z; β is a hyperparameter for φ
3. ad provides for the fact that multiple authors can write a single document; x is a randomly selected author from ad. (Note that there are only single authors in this selection; however, it is still necessary to identify author x.)
4. Given author x, we identify the topic z most likely to be associated with the given author.
5. Given topic z, we identify the words w most likely to be associated with the given topic.

Given the Bayesian network created from Figure 1, the joint probability of the author–topic pair was estimated using Gibbs sampling algorithm (Casella, 2001). This algorithm allows for an estimation of the unknown parameters of the model, namely, the probability of a topic given an author and the probability of a word given a topic. Gibbs sampling uses a successive Markov chain sampling to repeatedly draw x and z (as a pair), conditioned on all other variables. The process can be expressed as follows:



where nw[m][j] is the number of times a single word is assigned to a topic; nkwsum[j] is the number of times any word is assigned to a topic; na[x][j] is the number of times a single author is assigned to a topic; naksum[x] is the number of times any topic is assigned to an author.

Once this process has been iterated 2,000 times, the results of these variables are used to calculate Θ and φ.


Perplexity analysis is used to estimate the performance of the model. This analysis is used when we have an unknown probability distribution in the data. A lower perplexity value indicates better performance. As shown in Figure 2, the performance for our data stabilized at 50 topics.

At this point, all data were analyzed according to the author–topic model, divided into the five time slices. For each period, 50 topics were identified. Each topic contained a probability value—that is, the likelihood that the topic identified should be associated with the period. These topics were ranked by probability values and the top five were selected as being most representative of the period.

Similarly, a probability for each word was calculated to represent the association between a word and the given topic. These were ranked by probability values and the top 20 were chosen as most representative of the topic.

Lastly, the authors were assigned probability values for each topic and these too were ranked. The top five were chosen as highly representative authors for the given topic.

The results of the topic analyses are summarized in Figure 3, with the topics ranked from highest (top) to lowest for each period.



Some topics occur across multiple periods, including library history, librarianship, information use, citation analysis, classification, information retrieval (abbreviated as IR in Figure 3), and information-seeking behavior. These are the core topics in LIS dissertations from 1930 to 2009.

A limitation inherent with LDA analysis is in the manual interpretation and labeling of “topics.” Although some topics were fairly straightforward to label (e.g., Topic 5c, the top three loading words of which were (a) information, (b) seeking, and (c) behavior), others proved more difficult to ascertain the content or methodological relationship that connected the words and dissertations.

The top three topics by period identified in Järvelin and Vakkari’s (1993) content analysis of LIS journal articles is displayed in Table 7. These topics share some similarity with the present findings: classification is certainly a top theme in the first period, but there is an overwhelming emphasis on history in this period that is not represented by Järvelin and Vakkari’s results. ... The discrepancies between these findings lead to questions relative to genre and the possible impact of genre upon the topics. Further analysis should explore whether certain topics appear first in particular communicative genres and whether genres consistently emphasize different topics within a single field.



In terms of the highest loading specialties, this work confirmed an interest in information retrieval, bibliometrics, and information use consistent with the findings of White and McCain. However, the current analysis places a stronger emphasis on library evaluation, management, and education than was the case with White and McCain’s study. Although White and McCain did not undertake a detailed analysis on the changes in topics over time, they provided evidence for a shift toward the cognitive and user side of information retrieval.

Åström (2007) similarly evaluated LIS using cocitation analysis. He identified the main clusters by period as shown in Table 8.



For example, numerous studies summarized above noted the lack of theoretical work published in the LIS literature. It may be the case the journal articles are more likely to cover experimental work, whereas dissertations tend toward the theoretical.

However, the findings suggest that the bulk of dissertations completed at these schools do not have explicit connection to library practice.

Houser (1982) states that a “discipline is formed to solve a range of problems about some natural or social phenomena. . . [t]hese problems have a genealogy, that is, a continuity which forms the domain of the discipline” (p. 97).

Turner’s (2000) definition of a discipline focused on the identity of a shared name for a specialization and the exchange of scholars within that discipline (thereby propagating the discipline).

In examining dissertations, we were able to identify some dominant themes in LIS: These can be broadly defined as information-seeking, use, access, organization, and retrieval; and the education and training of the professionals providing these services.

2013年11月16日 星期六

Pirolli, P. (2008). A Probabilistic Model of Semantics in Social Information Foraging. In AAAI Spring Symposium: Social Information Processing (pp. 72-77).

Pirolli, P. (2008). A Probabilistic Model of Semantics in Social Information Foraging. In AAAI Spring Symposium: Social Information Processing (pp. 72-77).

本研究利用主題模型(topic model),對於社交媒體Lostpedia上的發文,分析主題,以視覺化的方式呈現主題間的關係,並探討與社交資訊的搜尋與意義建構(social information foraging and sensemaking)在時間上的演變能否反應現實的事件。本研究共分析18,238 筆Loatpedia上的頁面,詞語的種類共有34,352種,將每頁面上各種詞語的出現次數建立成矩陣,其中的809,444為非0值 。以Gibbs sampling技術訓練主題模型,將主題數T設為20,訓練次數為500次,另外兩個重要參數α與分別設為T/50與200/W。在得到訓練結果後,以每個主題上機率最高的10種詞語來代表該主題,並且以KL差異測量(KL divergence measure)評估主題間的相似程度,將測量結果輸入多維尺度(multidimensional scaling)分析產生二維圖形,藉以觀察主題間的關係。另外,本研究測量每周最為活躍的詞語,藉由這些詞語找出被感興趣主題的變化起伏。

Over the past decade, the Web and the Internet have evolved to become much more social, including being much more about socially mediated information foraging and sensemaking. This is evident in the growth of blogs [10], wikis [20], and social tagging systems [16]. It is also evident in the growth of academic interest in socially mediated intelligence and the cooperative production of knowledge [2, 18, 19].

With that motivation, this paper presents extensions of Information Foraging Theory [14] aimed at modeling the acquisition of semantic knowledge about some topic area at the social level. The model is an attempt to start to address the question: how do the semantic representations of a community engaged in social information foraging and sensemaking evolve over time?

The Topic Model has been mainly directed at modeling the gist of words and the latent topics that occur in collections of documents. It seems especially well suited to modeling the communal production of content, such as scientific literatures [8], or wikis, as will be illustrated below.

The Topic Model assumes there is an inferred latent structure, L, that represents the gist of a set of words, g, as a probability distribution over T topics. Each topic is a distribution over words. A document is an example of a set of words. The generative process is one in which a document (a set of words) is assumed to be generated as a statistical process that selects the gist as a distribution of topics contained in the document, and then selects words from this topic distribution and the distribution of words within topics. This can be specified as

where g is the gist distribution for a document, wi is the ith word in the document, selected conditionally on the topic zi selected for the ith word conditional on the gist distribution. In essence, P(z|g) reflects the prevalence of topics within a document and P(w|z) reflects the importance of words within a topic.

This generative model is an instance of a class of three-level hierarchical Bayesian models known as Latent Dirichlet Allocation [4] or LDA. The probabilities of an LDA model can be estimated with the Gibbs sampling technique in Griffiths and Steyvers [8].

Topics are represented as probability distributions over words, and document gist is represented as probability distributions over topics.



In the current modeling effort, there were two simple aims: (a) understanding the semantic topics that underlie a community wiki, and (b) understanding the changes in production of topic knowledge over time in reaction to events in the world.

The latent topics in this communal system surely change over time as new evidence and mysteries are presented in the world (mainly via the Lost program, but also via other media and information releases by writers and producers).

A Topic Model was estimated from a Lostpedia database containing page editing information. The database covered the period from September 21, 2005 to December 13, 2006. That period covers Season 2 and part of Season 3 of Lost.
Over that period, 18,238 pages were created and 160,204 page revisions were performed. The mean number of revisions per page was 8.78 revisions/page and the median was 2 revision/page.
The word activity table contains D = 18,238 pages and W = 34,352 distinct words.
The resulting word-document co-occurrence matrix of 18,238 pages and 34,352 words contained 809,444 entries.

LDA was performed using Gibbs sampling (see the Appendix) to produce T = 20 topics. Exploration of other values of T (both larger and smaller) were generally consistent with the T = 20 model, but this number of topics is about the right size to discuss in a paper. The sampling algorithm was run for 500 iterations with parameters α = T/50 and β = 200/W (see Appendix).

In the results of this estimation, each topic is represented as an estimated probability distribution over words. One way to understand each topic is to examine the most probable words associated with each topic.

Since the topics are represented as probabilities, the inter-topic similarities (or dissimilarities) can be defined using a KL divergence measure (see Appendix) and submitted to a multidimensional scaling (MDS) analysis that can be used to plot the topics as points in a lower-dimensional space such that the spatial distance among points reflects (as best possible) the dissimilarities among topics.

One way to assess the value of the Topic Model is to see what kinds of trends it detects in attention to topics over time, and to see how the model assessments relate to what we know about releases of information in the world of Lost.

The strength of a topic was determined as the proportion of weekly word activity assigned to that topic compared to all topics.

The Topic Model has been applied to the analysis of scientific literatures and used to make sense of trends in the scientific literature. It appears that the Topic Model can be used to analyze how a socially mediated sense making site can be similarly analyzed to reveal the waxing and waning of interest in semantic topics in reaction to events in the world.

The Topic Model was applied to the Lostpedia wiki and found to reveal a sensible set of semantic topics and tracked interest in those topics in reaction to events in Lost.

P(w|z) is represented as a set ϕ of T multinomial distributions over the W words such that

where zi is the topic associated with the word wi in document di and the jth topic is represented as a multinomial distribution with parameters ϕ(j).

P(z) is represented as a set θ of D multinomial distributions over the T topics, such that for each document there is a multinomial distribution with parameter θ(di), such that for every word wi in document di the topic zi of the word is



The Dirichlet distribution is the Bayesian conjugate for the multinomial distribution and provides Dirichlet priors on the parameters ϕ(j) and θ(di).


The Topic Model provides estimates of T multinomial distribution parameters, one for each topic such that the probability of any word w conditional on a topic j is defined as



The Topic Model also estimates a set of document probability distributions for over topics for each of the document d.