2013年11月21日 星期四

Henderson, K., & Eliassi-Rad, T. (2009, March). Applying latent Dirichlet allocation to group discovery in large graphs. In Proceedings of the 2009 ACM symposium on Applied Computing (pp. 1456-1461). ACM.

Henderson, K., & Eliassi-Rad, T. (2009, March). Applying latent Dirichlet allocation to group discovery in large graphs. In Proceedings of the 2009 ACM symposium on Applied Computing (pp. 1456-1461). ACM.

LDA-G(Latent Dirichlet Allocation for Graphs的簡稱)利用主題模型演算法(topic modeling algorithm)在實際的圖式資料(graph data,也就是網絡資料)發現隱藏群組結構(latent group structure)。

LDA主要應用於文本語料,這個方法假設在一個生成模型(generative model)中,語料庫內的每一文件由對於分布形式為多元常態分布(multinormal distribution)的多個主題重複取樣而得到,並且文件中每個詞語對應的主題在生成這個詞語時也是根據一個多元常態分布。整個生成模型可以簡化如下:

在推估後驗機率(posterior probability)時,可以如[4]所建議的方式,使用Gibbs sampling。

在應用LDA於圖式資料,假設節點間為有方向性的連結線(無方向性的連結線可以視為兩條有方向性的連結線),以連結出去的起始節點做為主題模型演算法中所謂的文件,被連結的結束節點則做為詞語。
由於文本語料和圖式資料有一些不同,因此需要對LDA進行修改。本研究列出文本語料和圖式資料的不同點,並且探討這些不同點對於在圖式資料中應用LDA的影響或是需要修改LDA的地方:
1. 文本語料中每個文件包含的詞語種類數都很多;然而根據冪次法則(power law),一般的圖式資料上大多數的節點擁有連出節點的數目都很少。這種情形在使用Gibbs sampling時將會產生為數相當多的群體,對於實際的圖式資料,這些多餘的群體並沒有顯著的有意義(significant meaning),因此在對圖式資料的型態(topology)沒有提供進一步的預測能力。
2. 許多詞語在文件中會重複出現,LDA也利用了這個特性;然而在一般的社交網絡上大多假設連結線為單一值。在實驗的情況中,連結線是否為多重值並沒有產生任何重大的影響,因此本研究建議當連結線為多重值時可以將它們轉換成單一值。
3. 詞語的順序與語意有某種程度的關連,在圖式資料上每個節點所連結的節點則無順序可言。然而LDA並沒有利用詞語的順序資訊,所以並沒有影響。

在運用LDA-G對圖式資料發現群組時可以使用有限的參數模型(a finite parametric model)或是無限的無參數模型(infinite nonparametric model),兩種模型的差異在於群組(也就是LDA的主題)數量的決定方式。有限的LDA-G模型由某些條件(例如連結的預測能力大小)來決定最佳的群組數量。無限的LDA-G模型則是將群組數量也視為要由Gibbs sampling演算法學習的變數之一。本研究使用的方式為有限模型。

本研究使用的研究資料是PubMed資料庫,針對三種圖式資料:1) Author x Knowledge: 是指從作者連結到知識主題(knowledge theme)的圖式資料,如果在資料庫中至少有一份文件,某一位作者u是它的作者,而且它的摘要內有包含知識主題k,則從u到k有一條連結線。在這個圖式資料上共有37,346位作者以及117個知識主題,並且共有119,443條連結線。 2) Author x Author: 是合著網絡(coauthorship network),如果資料庫裡有兩位作者合著的論文,他們的節點間便會有連結線。這個圖式資料有37,227位作者,並且共有143,364條連結線。3) Knowledge-Infused Author x Author: 這個圖式資料是在原先的Author x Author圖式資料加入有共同知識主題的作者之間的連結線,也就是如果兩位作者不曾合著過,但他們都曾撰寫某一個知識主題相關的文件達到十二次以上,他們之間便增加一條連結線。這個圖式資料擁有37,227個作者,但連結線增加339,644條。研究針對三種圖式資料測試LDA-G以及IRM (Infinite Relational Model)
和 Cross-association等群組發現方法,測試它們的連結線預測能力,以ROC曲線(ROC curve)下的面積做為比較的標準。結果發現平均而言,LDA-G的連結線預測能力比IRM好15%,並且比Cross-association好25%。

This paper introduces LDA-G, a scalable Bayesian approach to finding latent group structures in large real-world graph data. ... LDA-G (short for Latent Dirichlet Allocation for Graphs) utilizes a well-known topic modeling algorithm to find latent group structure. Specifically, we modify Latent Dirichlet Allocation (LDA) to operate on graph data instead of text corpora. Our modifications reflect the differences between real-world graph data and text corpora (e.g., a node’s neighbor count vs. a document’s word count).

In our empirical study, we apply LDA-G to several large graphs (with thousands of nodes) from PubMed (a scientific publication repository). We compare LDA-G’s quantitative performance on link prediction with two existing approaches: one Bayesian (namely, Infinite Relational Model ) and one non-Bayesian (namely, Cross-association). On average, LDA-G outperforms IRM by 15% and Cross-association by 25% (in terms of area under the ROC curve).

A canonical form of topic modeling is the latent Dirichlet allocation (LDA) [2]. LDA is a hierarchical nonparametric Bayesian approach to topic discovery in text corpora. It assumes a generative model in which each document di from a corpus samples (latent) topics from a multinomial distribution Mult(θ), and each topic zi samples words in the vocabulary from a multinomial distribution p(w|zi).

Gibbs sampling, such as in [4], can be used to estimate the posterior probability on configurations of the model.

A given configuration is an assignment z =< z1z2, ..., zn >, where each entry corresponds to the topic of a given word in the corpus.
The full generative model for a simplified version of LDA is as follows:

The runtime complexity of this inference approach is O(NKM).
For LDA, N is the number of documents, K is the number of topics, and M is the average length of documents.
For LDA-G, N is the number of nodes, K is the number of groups, and M is the average node degree in the graph.
The space complexity for both is O(N(K +M)).

There are three primary differences between real-world graphs (e.g., social networks, citation networks, etc) and text corpora.

The first difference is the distribution of node degrees vs. distribution of words. Text corpora usually have at least tens or hundreds of words in even the shortest documents. Social networks tend to have power-law distributions [9], with the modal degree being 1 and the mean degree being very low.
The degree distribution, however, does have an effect on performance. In particular, when there are hundreds of thousands of low-degree nodes in a network, the Gibbs sampler for LDA tends to generate thousands or tens of thousands of groups. The reason for this can be readily seen by looking at the sampling equations in [4]. This is a problem for group discovery, since these extra “topics” or groups have no significant meaning for real-world graphs, and do not provide additional predictive capacity on the topology of the graph.

The second major difference is that many real-world graphs, especially social networks, are not represented as multigraphs. Clearly, most documents of interest have some words that appear multiple times; and LDA takes advantage of this when modeling the topics.
The multigraph aspect is a valid consideration; but in our experiments (see Section 5), it does not have any major effect on performance. In fact, when we analyzed a real-world multigraph using LDA-G, it generated a more predictive model when all multiple edges were transformed into single edges. Thus, the multigraph discrepancy can be safely ignored.

Finally, text documents are semantically sensitive to word ordering, whereas in social networks, no order is assigned to the edges of a given vertex.
Word order is not part of the LDA model, so it can be disregarded in our adaptation.

To perform group discovery on graphs, LDA-G can use either a finite parametric model or an infinite nonparametric model.

In the finite case, the model is truncated at some number of groups, K, which is chosen on a per-application basis. There are several considerations which can affect the proper choice of K. For our purposes, we use link prediction on a tuning set as the main criterion.

The infinite LDA-G model learns the appropriate model size through yet another hyperparameter and alterations to the Gibbs sampler update.

In our experiments (see Section 5), we find finite LDA-G with model selection to be superior to the full nonparametric version.

Author x Knowledge Graph: This is a bipartite graph with author nodes connected to knowledge theme nodes. A link exists from an author u to a knowledge theme k for every paper in which u is a coauthor and k is a theme appearing in the abstract. This graph is a multigraph, but as noted above we summarize any multiple edges as single edges. This graph contains 37,346 authors and 117 knowledge themes, with 119,443 author-knowledge edges (see Figure 2).

Author x Author Graph: This graph is a coauthorship network, two author nodes are connected if they published a paper together. This symmetric graph has a clique for each publication. It contains 37,227 authors and 143,364 edges (see Figure 2). Note that there are some authors who appear in this graph but not in the Author x Knowledge Graph and vice versa.

Knowledge-Infused Author x Author Graph: This graph is constructed by infusing the Author x Author Graph with information from the Author x Knowledge Graph. We generate this graph as follows. First, we prune the Author x Knowledge multigraph by removing links that appear less than 12 times. We pick the threshold of 12 based on graphsize considerations. This step effectively removes noise in the Author x Knowledge Graph. Second, in the Author x Author Graph, we add a link between any pair of authors that share a knowledge theme in the pruned Author x Knowledge graph. This produces a denser “knowledge-infused” Author x Author Graph with 37,227 authors and 339,644 edges (see Figure 2).

In particular, we use stratified random sampling to hold-out some links from the input graph. The remaining links are used to discover the latent groups. Then, the superiority of these groups is checked based on how well they predict the existence of the held-out links.

For LDA-G, it is the average over all sampled configurations of Σt∈topics p(t|vi)p(vj |t).

In all experiments, we construct the ROC curve on a test set of unobserved links (randomly selected from the graph data) and report the average Area Under the ROC Curve (AUC) over 5 independent trials.
This paper describes LDA-G, a scalable and expressive approach to this problem. LDA-G modifies a popular topic modeling algorithm for graph data. Comparative experiments with IRM (an alternative Bayesian approach) and Cross-association (a compression-based approach) illustrate the superiority of LDA-G in terms of both quantitative and qualitative results.

沒有留言:

張貼留言