Topic Model & LDA
本研究建議運用LDA(latent Dirichlet allocation)模型對社交網絡上圖式資料(graph)的節點與連結線進行分群(grouping)以及標示(labeling),藉以改善K-means [15] 及 DBSCAN [6]等常用方法無法對節點進行多重分群或是無法同時標註節點與連結線等問題。
過去在社交網絡上應用LDA模型的研究大多是根據作者-主題模型(author-topic model, AT model)[22]修改發展而來,著名的研究包括社群使用者主題(community user topic, CUT)模型[27]、作者-接收者-主題(author-recipient-topic, ART)模型與角色-作者-接收者-主題(role-author-recipient-topic, RART)模型[16]以及社群-作者-接收者-主題(community-author-recipient-topic, CART)模型[19]。此外,Zhang et al. [26]和Henderson et al. [9]的研究都是將LDA模型應用於學術社交網絡(academic social networks),Zhang et al. [26]的研究目的是探討將合著資訊(co-authorship)轉換為圖式資料的方法,Henderson et al. [9]的重點則是在處理圖式資料中大量的較低連結的節點。
圖式資料的分群問題如下圖所表示,F與G分別是連結線開頭與結尾的節點所成的集合,E則是連結兩邊節點的連結線所成的集合,Z代表的是連結線的標示以及節點的分群。這個研究要探討的問題是如何標示F與G的節點之間的連結線?並且如何根據連結線的標示結果將G內的節點分群?
如果某一節點fi連結到另一個節點gj,是因為fi對gj中的某一個主題有共同的興趣。一個人可能會對多個主題感興趣,因此節點fi會連結到會來自不同主題的節點,就如同文件中會包含不同主題的詞語一樣,並且這些主題的分布為多元常態分布(multinormal distribution)。同樣的情形,每一個被連結到的節點可以出現在多個不同的主題上。另一方面,許多人會對某個特定主題有不同程度的興趣,因此每個主題內包含的節點也呈現多元常態分布,就像主題可以利用多種不同的詞語來表示,並且有些詞語會比其他的詞語相關。因此圖式資料的分群與標示問題可以將F與G內的節點分別類比成文件以及文件上出現的詞語,並且應用LDA模式的概念,利用得到的隱藏主題做為連結線的標示,並且對G內的節點的節點分群。
但是由於LDA本身是基於文件生成而發展的模型,在這個模型裡,出現在過多文件上的詞語通常比較無法代表特定的主題,被認為較不重要的詞語,因此往往會預先被過濾,不參與模型的訓練;然而在諸如社交網絡的圖式資料連結許多節點的節點則往往能夠代表多個較重要的主題,因此在將LDA模型應用於圖式資料的節點與連結線分群及標示應用時,應該納入這些連結較多的節點進行分析。本研究改良了LDA模型來解決上述的問題,使其可以運用於社交網絡平台上的朋友推薦(friend recommendations)功能。
本研究提出四種改良方案:
1) 以不對稱的先驗機率(a prior)取代對稱的先驗機率(a prior),使得連結較多的節點具有較高的先驗機率,例如使用下面的式子做為先驗機率的估計值。
2) 利用階層式LDA(hierarchical LDA)模型,使得較常出現的主題在上層,比較特定的主題則在下層。
3) 採用兩階段標示(two-step labeling)方式,第一階段先排除連結較多的節點以及連結到這些節點的連結線,只利用連結較少的節點資料建立主題,避免連結較多的節點的干擾,第二階段則利用第一階段建立好的主題,以下面的公式對連結較多的節點的連結線進行標示。
此處P(e(f, g) = z|. )為將從 f 到g的連結線標示為某一個主題z的機率,Ngz和Nfz分別表示g和f被標示為z的次數。
本研究從2009年十月到2010年一月間2億7千三百萬餘筆的Twitter資料中選取1千萬筆資料進行實驗,標示的主題個數為100,研究結果發現不管是複雜度(perplexity)或是分群的品質,後兩種方法(兩階段標示以及雜訊閾值過濾)都比其他的方法有較好的成效。
In this paper, we discuss how we can extend probabilistic topic models to analyze the relationship graph of popular social-network data, so that we can "group" or "label" the edges and nodes in the graph based on their topic similarity.
In particular, we first apply the well-known Latent Dirichlet Allocation (LDA) model and its existing variants to the graph-labeling task and argue that the existing models do not handle popular nodes (nodes with many incoming edges) in the graph very well. We then propose possible extensions to this model to deal with popular nodes.
Our proposed methods can be used for providing, for instance, more relevant friend recommendations within a social network.
In this paper, we explore techniques to provide more structure to this follow relationship (1) by "grouping" the users based on their topic interests, and (2) by "labeling" each follow relationship with the identified topic group.
Roughly, we can consider our goal as a clustering (or classification) problem, where many popular solutions such as K-means [15] and DBSCAN [6] exist. These existing methods, however, are not appropriate for our task because they either (1) associate each node with a single group (hard clustering) or (2) can associate each node with multiple groups (soft clustering), but require a completely separate method to label edges as well as nodes (since a node may be associated with multiple groups).
In this paper, we apply a well-known probabilistic topic model, called Latent Dirichlet Allocation (LDA), to the follow relationship graph of the social network, in order to label the nodes and the edges in the graph with (possibly) multiple topics. ...In particular, the direct application of LDA to our task requires that every node in the graph should be of roughly equal popularity and that we should remove nodes of high popularity from the dataset. This is particularly problematic because these popular nodes are really the ones that we want to label accurately; many users are particularly interested in identifying the topic groups of these popular users. Earlier work on the application of the LDA model to social graph [9, 26] has not addressed the handling of popular nodes.
Recently, Mislove et al. [18] identified communities in Facebook based on the social graph structure and inferred unknown attributes through this community information. The methods they used to cluster communities are based on pruning edges from the whole social graph and adding edges from some seed nodes, both of which are very common and widely used approaches in social-network analysis. However, these approaches produce mutually exclusive groups and cannot support multiple memberships, which is important in our scenario where users have a variety of interests.
Though not directly related to social-network analysis, the concept of author/user was initially introduced in the Author-Topic (AT) model [22]. It was used to extract hidden research topics and trends from CiteSeer's abstract corpus.
Zhou et al. [27] modified the AT model and proposed the Community User Topic (CUT) model to capture semantic communities.
McCallum et al. [16] extended the AT model and proposed the Author-Recipient-Topic (ART) model and the Role-Author-Recipient-Topic (RART) model in order to analyze the Enron e-mail corpus and an academic e-mail network.
Pathak et al. [19] modified the ART model and suggested the Community-Author-Recipient-Topic (CART) model similar to the RART model.
Besides these members of the AT model family, Wang et al. [24] introduced the Group Topic (GT) model and applied it to voting data from US Senate and the General Assembly of the UN.
Mei et al. [17] also introduced a regularized topic modeling framework incorporating a graph structure in the data.
Other LDA extensions and probabilistic topic models were also proposed for annotation data analysis [12], chat data analysis [23], tagging data analysis [8], and pairwise data analysis [2].
Perhaps the work by Zhang et al. [26] and by Henderson et al. [9] is closest to our work. In both works, the authors applied LDA to academic social networks. However, their respective focus was quite different from ours.
For example, [26] focused on the issue of how to convert the co-authorship information into a graph (e.g., direct co-authorship or indirect co-authorship, and edge weighting scheme based on collaboration frequency).
Henderson et al. [9] addressed the issue of a large number of topic clusters generated due to low popularity nodes in the network, while our primary focus is the effective clustering of high popularity nodes.
Because LDA evolved from Probabilistic Latent Semantic Indexing (PLSI) [11], we rst describe the concept of PLSI and then explain what di erentiates LDA from PLSI. ... PLSI introduced a probabilistic generative model to topic models. Equation (1) represents its document generation process based on the probabilistic generative model:
P(d, w) is the probability of observing a word w in a document d and can be decomposed into the multiplication of P(d), the probability distribution of documents, and P(w|d), the probability distribution of words given a document. This equation describes a word selection for a document, where we first select a document then a word in that document. If we iterate this selection multiple times, we can generate a document and eventually a whole document corpus.
By assuming that there is a latent topic z, we can rewrite the equation above with the multiplication of P(w|z), the probability distribution of words given a topic, and P(z|d), the probability distribution of topics given a document. This equation describes adding an additional topic selection step between the document selection step and the word selection step. As there are multiple latent topics where a word may come from, we sum the multiplication over a set of all the independent topics Z.
Though PLSI is equipped with a sound probabilistic generative model and a statistical inference method, it suffers from the overfitting problem and does not cope well with unobserved words.
To solve this problem, Blei et al. [4] introduced Dirichlet priors α and β to PLSI, to constrain P(z|d) and P(w|z), respectively.
α is a vector of dimension |Z|, the number of topics, and each element in α is a prior for a corresponding element in P(z|d). Thus, a higher αi implies that the topic zi appears more frequently than other topics in a corpus.
Similarly, β is a vector of dimension |W|, the number of words, and each element in β is a prior for a corresponding element in P(w|z). Thus, a higher βj implies that the word wj appears more frequently than other words in the corpus.
As a conjugate prior for the multinomial distribution, the Dirichlet distribution can also simplify the statistical inference. By placing Dirichlet priors α and β on the multinomial distributions P(z|d) and P(w|z), those multinomial distributions are smoothed by the amount of α and β and become safe from the overfitting problem of PLSI.
Thus, we use the term follow when a user adds another user as her friend. Formally, when a user f follows a user g, f generates a follow edge, or simply an edge, e(f; g) from a follower f to a followed user g. We also use e'(f; g) to denote an edge from g to f (indicating that g is followed by f), e(f) to denote the set of all outgoing edges from f, and e'(g) to denote the set of all incoming edges to g. To
refer to the set of all followers, the set of all followed users, and the set of all edges in the dataset we use F, G, and E, respectively.
Figure 1(a) depicts this notation using a graph that we refer to as a subscription graph. ... Given this subscription graph, our goal is to label each edge with a correct label (interest) and group (label) each followed user based on those labeled edges. Now we can frame our problem as the graph labeling problem of automatically associating each user gi in G with a set of accurate interests zk in Z based on its labeled incoming edges e'(gi). (We also label fj in F as well.)
Furthermore, we can consider a follower f to be a document d, a followed user g as a word w, and a list of followed users for the follower as the content of the document.
However, there is a subtle difference between the two generative processes. While words are sampled with replacement in the standard LDA, followed users are sampled without replacement in our model. For example, in a document generative model, a document may contain the word car multiple times. On the other hand, in our edge generative model, a user cannot follow the same user Barack Obama multiple times. As a result, the probabilistic distribution in our model does not follow a multinomial distribution but follows a multivariate hypergeometric distribution. Fortunately, the multinomial distribution can also be used for our model because it is known that a multivariate hypergeometric distribution converges to a multinomial distribution as the sample size grows large [1]. In our case, since sampling is done on millions of nodes, the two distributions become practically indistinguishable.
Also, when we represent E in matrix form by putting F in the rows and G in the columns as E (belongs) power(B, FXG), where B = {0,1}, some differing aspects are noticed:
As in a matrix formed from a document corpus, the distribution of the column sums show a power-law distribution, in which some small portion of the columns accounts for a majority of the total column document corpus, these columns correspond to words such as the and is, which we call stop words, and are generally removed. Those columns in our matrix, however, correspond to users such as Barack Obama and Britney Spears, which we can call popular users, and should be taken into account.
In a document corpus analysis, the stop words are generally removed before analysis, since an analysis without removing these stop words produces a very noisy result where the frequent stop words are labeled with every topic [4]. However, in our analysis, popular users are very important to include in the analysis because most users are keenly interested in following famous and influential users whose topic interests are similar to theirs. Unfortunately, it is not sufficient to simply include the popular users for LDA analysis, because the inclusion of popular users produces the same noisy result seen in the text analysis case: when stop words (or popular users) are included, they get included in every topic group, producing a very noisy result.
Though each element of vectors α and β may take different values in principle, in the standard LDA, each element of α and β is assumed to have the same value (often referred to as the symmetric prior assumption). Intuitively, this assumption implies that every topic and word in a document corpus is equally likely. ... As a higher prior value implies a higher likelihood of being observed in the corpus, we set each prior value proportional to the number of incoming edges of each followed user. It is expected to associate popular users with more accurate labels as they are given adequate prior values.
We set βgi , the prior for the followed user gi in the vector , as in Equation (3):
Hierarchical LDA (HLDA) is also a good candidate for our problem because it generates a topic hierarchy and more frequent topics are located at higher levels.
In HLDA, when a document is generated according to Equation (1), words are chosen from topics in a document path. Since the top level topic is associated with all the documents, common words in every document (i.e., stop words) are expected to be labeled with the top level topic. On the contrary, the bottom level topics are expected to be more specific as they are associated with a small number of documents.
This topic hierarchy is established because HLDA is based on the Nested Chinese Restaurant Process (NCRP), a tree extension to Chinese Restaurant Process, which probabilistically generates a partition of a set {1, 2, ......, n} at time n. In NCRP, a document is considered as a Chinese restaurant traveler who visits L restaurants along a restaurant tree path, where L refers to the level of the tree (i.e., the length of the path).
We decompose the labeling process into two sub-processes of establishing topics and labeling users with the established topics. In the first topic establishment step, we run LDA after removing popular users from the dataset similar to how we remove stop words before applying LDA to a document corpus. This step generates clean topics free from the noise generated by popular users. In the second labeling step, we apply LDA only to popular users in the dataset. As we use the collapsed Gibbs sampling algorithm [21], edges to popular users are labeled according to the pre-established topics as represented in Equation (4).
where P(e(f, g) = z|. ) denotes the probability of labeling the edge from a follower f to the followed user g with a topic z, given all conditions, Ngz denotes the number of times g is labeled with z, and Nfz denotes the number of times f is labeled with z.
As briefly described in Section 3.1, the association between a user (a word) and a topic has varying association strengths represented by P(g|z) (P(w|z)) in a probabilistic topic model. Thus, we can list users labeled with the same topic in descending order of P(g|z) and regard the top entries in the list (topic group) as more important than the entries at the bottom because they are more strongly associated with the topic. Similarly, we can measure the association strength from the user's viewpoint using P(z|g). Though two-step labeling may help label popular users with the right topics, popular users may take top positions even in less-relevant topic groups because even the smallest number of times a popular user is assigned to a topic group could outnumber the largest number of times a non-popular user is assigned to that topic group.
To mitigate this problem, we propose a new approach, threshold noise filtering, which sets a cut-off value to determine whether to label a user with each topic. By ignoring assignments below the cut-off value, we can expect smoothing and noise reduction effects as in the anti-aliasing filter.
We set Ngizk, the number of times a followed user gi is assigned to a topic group zk, as 0 if it is below the cut-off value C:
As threshold noise filtering process is done after sampling, it does not increase computational complexity similar to two-step labeling.
Alternatively, we may filter out less relevant topics by keeping only the top-K topics for each user, for a reasonably small K value. We tested both schemes (threshold noise filtering and top-K filtering), but we couldn't see any practical differences.
Though threshold noise filtering can be used with any other approaches, we combine it with the two most representative cases, two-step labeling and the standard LDA in our experiments.
For our experiments, we use a Twitter dataset we collected between October 2009 and January 2010. The original downloaded dataset contained 273 million follow edges, but we sampled 10 million edges from this dataset to keep our experiment manageable. To ensure that all the follow edges were preserved in our sampled dataset, we first sampled followers randomly and included all follow edges from the sampled users, until we obtained 10 million follow edges.
From the graph, it is clear that the number of incoming edges follows a power-law distribution, which is often the case for this type of dataset. Interestingly, we observe that the number of outgoing edges is quite uniform between edge counts 1 to 100, which is different from the distribution reported in [14]. We do not believe this difference is due to our sampling, because the graph from the complete dataset shows the same at curve between the edge counts 1 and 100 [25].
Perplexity is a well-known standard metric used in IR. It tries to quantify the accuracy of a model by measuring how well the trained model deals with an unobserved test data. More precisely, perplexity is defined to be [26]:
where Etest denotes all the edges in the test dataset. We calculated Perplexity values for the 10% held-out dataset after training the models on the remaining 90% dataset. In general, lower perplexity means better generalizability.
Note that the original LDA model is designed to derive an optimal model that minimizes perplexity. Therefore, it is unlikely that extensions to base LDA show lower perplexity. Our primariy goal of comparing the results under the perplexity metric is to see whether our proposed extensions significantly degrade perplexity.
From the graphs, we first notice that HLDA (hlda-3lv and hlda-2lv) shows significantly worse Perplexity than others. This is due to the fact that the standard LDA is designed to minimize Perplexity while HLDA is not necessarily designed for this task.
Overall, we find that our two-step labeling and threshold noise filtering show similar perplexity values to the standard LDA model. That is, our extensions to LDA do not introduce noticeable Perplexity degradation to the standard LDA model.
To measure human-perceived quality, we conducted a survey with a total of 14 participants. The participants in our survey were presented with a random group of ten Twitter users (identified by one of the eight approaches described in Section 5.1) and were asked to indicate if each user in the topic group was relevant to the group or not. Overall, we collected 23 judged topic groups per each approach with a total of 161 judged topic groups.
Both of our extensions, two-step labeling and threshold noise filtering are very effective in improving the human-perceived quality of identified topic groups.
As two-step filtering initially forms clean topics free from popular user generated noise, its gain is more significant than that of threshold noise filtering. However, threshold noise filtering can be easily combined with any other approaches to improve Quality, as in filter-base.
If we apply the standard LDA to popular users as well without any pre-filtering, the produced topic groups are quite noisy.
Using an asymmetric Dirichlet prior for the hyperparameter β improves Quality, but the result is still noisier compared to our extensions.
The HLDA model shows only marginal improvements in the Quality metric when compared to base. This low performance is because most of the low-level groups in HLDA have users with very few followers (who are less likely to be interesting when recommended) and contribute only small weights to Quality.
Note that many users in this group are very popular and they are mainly about the same topic, tech media, indicating that 2step is able to group popular users into the right topic group in general. However, we observe that 2step still su ers from the presence of a few popular, yet less relevant users in the group (in this example, cnnbrk and breakingnews may be considered less relevant to the group than others)
With threshold noise filtering, we achieved less noisy results. Figure 12 shows a result topic group from filter-2step corresponding to the group in Figure 11 from 2step. We observe that cnnbrk and breakingnews, popular users on general media, are now removed from Figure 11 and more technology-centric media such as firefox, youtube, and engadget are added.
Even with this relatively limited type of data compared to those of previous approaches, our approaches generated very well-organized results and showed the great potential of applying LDA to these kinds of clustering applications. Our approaches are especially useful when only linkage data is available.
沒有留言:
張貼留言