2013年12月6日 星期五

Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004, July). The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 487-494). AUAI Press.

Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004, July). The author-topic model for authors and documents. In Proceedings of the 20th conference on Uncertainty in artificial intelligence (pp. 487-494). AUAI Press.

本研究提出作者-主題模型(author-topic model),這是一個擴充LDA (Latent Dirichlet Allocation; Blei, Ng, & Jordan, 2003) 並加入作者資訊的文件產生模型。和基本的LDA相同的是都用一組的主題混合(mixture)來代表每一筆文件,但這個模型具有能夠由文件的作者決定表現文件上不同主題的混合權重(mixture weights)的特性。為了進一步說明這個模型,本研究比較了基礎的LDA、作者模型以及作者-主題模型的文件產生過程。首先,LDA的文件產生過程包含三個步驟:首先由一個Dirichlet分布中取樣,產生每一筆文件的主題分布;其次在產生文件中的每一個詞語之前,先從上面的主題分布中選取一個主題;最後從選定的主題對應的詞語分布上取樣產生這個詞語。整個模型如下圖所示

在這個模型裡,ϕ表示主題分布的矩陣,利用一個由詞彙裡的V個詞語的多元常態分布來表示T個主題中的每一個主題,而這些多元常態分布是從一個Dirichlet分布Dirichlet(β)中獨立地抽取而形成的。θ是文件上T個主題的混合權重所組成的矩陣,每一個文件的主題混合權重都是由一個Dirichlet分布Dirichlet(α)中獨立地抽取而形成的。最後,每一個文件中的詞語有一個相對應的主題zz是從文件相對應的主題混合權重θ取樣產生,而這個詞語則是根據主題z所對應的主題分布ϕ 所產生。運用演算法推導文件的產生模型,估算ϕθ可以分別提供語料中具有的主題以及這些主題在各個文件上的權重等資訊,常用的演算法包括變異推論(variational inference; Blei et al., 2003)、期望值延遲(expectation propagation; Minka & Laerty, 2002)和Gibbs取樣(Gibbs sampling; Griffiths & Steyvers, 2004)等。

作者模型如下圖所示

這個模型假設某一篇文件是由一群作者ad共同撰寫,某一位作者都擁有一套本身習慣使用的詞語組合,這個詞語組合可以由詞語的機率分布ϕ來描述,從一個Dirichlet分布Dirichlet(β)中獨立地抽取而形成的。在產生文件時,文件上的每一個詞語由ad中依據均勻分布(uniform distribution)隨機選取某一位作者x所撰寫,而詞語產生的機率便是由作者x對應的詞語機率分布來決定。因此若能估算ϕ,便能提供有關作者的研究興趣方面的資訊,並且進而從作者撰寫的文件內容的相似性估計他們在研究主題上的相似性。然而這個模型的問題是興趣主題的估算僅受限於作者所撰寫過的文件內容上的詞語,無法進一步擴及相同主題但使用不同詞語的文件。

作者-主題模型整合上述兩個模型。這個模型如同作者模型假設文件上的每一個詞語由一群共同作者ad中依據均勻分布(uniform distribution)隨機選取某一位作者x撰寫,但作者-主題模型假定每位作者本身擁有一套主題組合θθ上的機率分布是從Dirichlet(α)中獨立地抽取而形成的。所以在產生每一個詞語之前,先根據選定的作者x對應的主題機率分布θ挑選一個主題z,然後再根據這個主題對應的詞語分布機率ϕ挑選出一個詞語w做為輸出,詞語分布機率ϕ的產生則是由Dirichlet(β)中獨立地抽取而成。作者-主題模型如下圖所示

本研究運用Gibbs取樣推估各種模型的參數,這種方法提供根據Dirichlet先驗機率(prior)獲得參數估計的簡單方法並且允許從許多後驗機率分布(posterior distribution)的局部最大值(local maxima)組合估計值。

首先是LDA模型,這個模型包含兩組未知的參數:θϕ,以及隱藏的變數(latent variables)也就是指定給每一個詞語的主題z。通過採用在z上的Gibbs取樣(Gilks, Richardson, & Spiegelhalter, 1996),可以構建一個Markov鏈(Markov chain),並且這個Markov鏈收斂後驗分布(posterior distribution),然後利用這些結果推斷θϕ(Griffths & Steyvers, 2004)。在Markov鏈接續狀態間的轉移是來自於重複從以所有其他變數為條件下的分布上抽取z的結果。如下面的式子

此處zi = j代表文件的第i個詞語指定為主題j的情形;wi = m則代表第i個詞語實際上是詞彙中的詞語m的情形,z-i表示不包括第i個詞語的所有詞語的主題指定情形。CWTmj是不包括目前的案例下,詞語m被指定為主題j的次數,CDTdj則是不包括目前的案例下,主題j出現在文件d的次數。然後運用上面的式子產生Markov鏈,從上取樣所有訓練文件內各詞語的指定主題,並估算θϕ

此處ϕmj是在主題j中使用詞語m的機率,θdj 則是主題j在文件d中的機率。

運用同樣的方式來推測作者模型的未知參數ϕ,這個模型的隱藏變數是文件中每一個詞語所指定的作者x。在Markov鏈接續狀態間的轉移是在以所有其他變數為條件下的分布上重複抽取x的結果

此處xi = k 代表將第i個詞語指定為作者k的情形, CWAmk則是不包括目前的案例下,詞語m被指定為作者k的次數。所以作者k使用詞語m的機率可以用下面的式子推估



在作者-主題模型方面,這個模型包括兩組隱藏變數zx,在其他已知與未知變數的條件下,針對某一個詞語wi分別指定作者xik 與主題zij的機率

此處z-i,x-i代表不包括第i個詞語的所有詞語的主題與作者指定情形。CATkj 是不包括目前的案例下,作者k被指定為主題j的次數。根據上面的式子,給定主題j,詞語m被選取的機率ϕmj與給定作者k,主題j被選取的機率分別可以用下面的式子進行估算。



本研究以兩個資料集NIPS與Citeseer進行實驗,對作者-主題模型所發現的每一個主題,以從對應的主題機率分布內選取機率較高的詞語為該主題的標示,同時並選取該主題機率較高的代表作者。結果發現標示詞語大多能夠具體地代表不同的主題,而作者也大多是該主題非常知名的作者。NIPS資料集共計有1740篇研究會論文,共2037位作者,論文中所用的詞彙包括13649種詞語,資料集實際出現的詞語共有2301375次;CiteSeer資料集共計有162489篇論文摘要,共85465位作者,摘要中所用的詞彙包括30799種詞語,資料集實際出現的詞語共有11685514次。此外,本研究並且利用複雜度(perplexity)比較三種模型在詞語預測的成效。複雜度的計算如下

複雜度愈低表示該模型在詞語預測成效愈好。結果發現作者模型因為受限於模型本身概化的能力(generalization)較弱,所得到的成效比其他兩種主題模型差。因為加入來自作者的資訊,作者-主題模型在少量的訓練資料下有LDA較好的成效;當訓練語料增加,LDA便有較低的複雜度,並且當模型的主題數更多時,LDA需要的訓練語料更少。最後,本研究利用對稱 KL差異度(symmetric KL-divergence)測量作者的研究主題相似性,

並且利用熵(entropy)測量作者的研究主題分布廣度。

In our results we used two text data sets consisting of technical papers|full papers from the NIPS conference and abstracts from CiteSeer (Lawrence, Giles, & Bollacker, 1999). ... This leads to a vocabulary size of V = 13,649 unique words in the NIPS data set and V = 30,799 unique words in the CiteSeer data set. Our collection of NIPS papers contains D = 1,740 papers with K = 2,037 authors and a total of 2,301,375 word tokens. Our collection of CiteSeer abstracts contains D = 162,489 abstracts with K = 85,465 authors and a total of 11,685,514 word tokens.

We introduce the author-topic model, a generative model for documents that extends Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003) to include authorship information. Each author is associated with a multinomial distribution over topics and each topic is associated with a multinomial distribution over words. A document with multiple authors is modeled as a distribution over topics that is a mixture of the distributions associated with the authors.

We apply the model to a collection of 1,700 NIPS conference papers and 160,000 CiteSeer abstracts.

Recently, generative models for documents have begun to explore topic-based content representations, modeling each document as a mixture of probabilistic topics (e.g., Blei, Ng, & Jordan, 2003; Hofmann, 1999).

With an appropriate author model, we can establish which subjects an author writes about, which authors are likely to have written documents similar to an observed document, and which authors produce similar work.

This generative model represents each document with a mixture of topics, as in state-of-the-art approaches like Latent Dirichlet Allocation (Blei et al., 2003), and extends these approaches to author modeling by allowing the mixture weights for different topics to be determined by the authors of the document.

In LDA, the generation of a document collection is modeled as a three step process. First, for each document, a distribution over topics is sampled from a Dirichlet distribution. Second, for each word in the document, a single topic is chosen according to this distribution. Finally, each word is sampled from a multinomial distribution over words specific to the sampled topic.

This generative process corresponds to the hierarchical Bayesian model shown (using plate notation) in Figure 1(a).



In this model, ϕ denotes the matrix of topic distributions, with a multinomial distribution over V vocabulary items for each of T topics being drawn independently from a symmetric Dirichlet(β) prior. θ is the matrix of document-specific mixture weights for these T topics, each being drawn independently from a symmetric Dirichlet(α) prior. For each word, z denotes the topic responsible for generating that word, drawn from the θ distribution for that document, and w is the word itself, drawn from the topic distribution ϕ corresponding to z.

Estimating ϕ and θ provides information about the topics that participate in a corpus and the weights of those topics in each document respectively.

A variety of algorithms have been used to estimate these parameters, including variational inference (Blei et al., 2003), expectation propagation (Minka & Laerty, 2002), and Gibbs sampling (Griffiths & Steyvers, 2004).

Assume that a group of authors, ad, decide to write the document d. For each word in the document an author is chosen uniformly at random, and a word is chosen from a probability distribution over words that is specific to that author.

x indicates the author of a given word, chosen uniformly from the set of authors ad. Each author is associated with a probability distribution over words ϕ, generated from a symmetric Dirichlet(β) prior. Estimating ϕ provides information about the interests of authors, and can be used to answer queries about author similarity and authors who write on subjects similar to an observed document.

However, this author model does not provide any information about document content that goes beyond the words that appear in the document and the authors of the document.

As in the author model, x indicates the author responsible for a given word, chosen from ad. Each author is associated with a distribution over topics, θ, chosen from a symmetric Dirichlet(α) prior. The mixture weights corresponding to the chosen author are used to select a topic z, and a word is generated according to the distribution ϕ corresponding to that topic, drawn from a symmetric Dirichlet(β).

In this paper, we will use Gibbs sampling, as it provides a simple method for obtaining parameter estimates under Dirichlet priors and allows combination of estimates from several local maxima of the posterior distribution.

The LDA model has two sets of unknown parameters -- the D document distributions θ, and the T topic distributions ϕ -- as well as the latent variables corresponding to the assignments of individual words to topics z. By applying Gibbs sampling (see Gilks, Richardson, & Spiegelhalter, 1996), we construct a Markov chain that converges to the posterior distribution on z and then use the results to infer θ and ϕ (Griffths & Steyvers, 2004). The transition between successive states of the Markov chain results from repeatedly drawing z from its distribution conditioned on all other variables, summing out θ and ϕ using standard Dirichlet integrals:

where zi = j represents the assignments of the ith word in a document to topic j , wi = m represents the observation that the ith word is the mth word in the lexicon, and z-i represents all topic assignments not including the ith word. Furthermore, CWTmj is the number of times word m is assigned to topic j, not including the current instance, and CDTdj is the number of times topic j has occurred in document d, not including the current instance.

For any sample from this Markov chain, being an assignment of every word to a topic, we can estimate and using


where ϕmj is the probability of using word m in topic j, and θdj is the probability of topic j in document d. These values correspond to the predictive distributions over new words w and new topics z conditioned on w and z.

An analogous approach can be used to derive a Gibbs sampler for the author model.

where xi = k represents the assignments of the ith word in a document to author k and CWAmk is the number of times word m is assigned to author k.



In the author-topic model, we have two sets of latent variables: z and x. We draw each (zi, xi) pair as a block, conditioned on all other variables:

where zi = j and xi = k represent the assignments of the ith word in a document to topic j and author k respectively, wi = m represents the observation that the ith word is the mth word in the lexicon, and z-i,x-i represent all topic and author assignments not including the ith word, and CATkj is the number of times author k is assigned to topic j, not including the current instance.

Equation 4 is the conditional probability derived by marginalizing out the random variables ϕ (the probability of a word given a topic) and θ (the probability of a topic given an author).




In the examples considered here, we do not estimate the hyperparameters α and β instead the smoothing parameters are fixed at 50/T and 0.01 respectively.

We start the algorithm by assigning words to random topics and authors (from the set of authors on the document). Each iteration of the algorithm involves applying Equation 4 to every word token in the document collection, which leads to a time complexity that is of order of the total number of word tokens in the training data set multiplied by the number of topics, T (assuming that the number of authors on each document has negligible contribution to the complexity).

In our results we used two text data sets consisting of technical papers|full papers from the NIPS conference and abstracts from CiteSeer (Lawrence, Giles, & Bollacker, 1999). ... This leads to a vocabulary size of V = 13,649 unique words in the NIPS data set and V = 30,799 unique words in the CiteSeer data set. Our collection of NIPS papers contains D = 1,740 papers with K = 2,037 authors and a total of 2,301,375 word tokens. Our collection of CiteSeer abstracts contains D = 162,489 abstracts with K = 85,465 authors and a total of 11,685,514 word tokens.

Perplexity is a standard measure for estimating the performance of a probabilistic model. The perplexity of a set of test words, (wd, ad) for d ∈ Dtest, is defined as the exponential of the negative normalized predictive likelihood under the model,



Better generalization performance is indicated by a lower perplexity over a held-out document.

The author model is clearly poorer than either of the topic-based models, as illustrated by its high perplexity. Since a distribution over words has to be estimated for each author, fitting this model involves finding the values of a large number of parameters, limiting its generalization performance.

The author-topic model has lower perplexity early on (for small values of N(train)d ) since it uses knowledge of the author to provide a better prior for the content of the document. However, as N(train)d increases we see a cross-over point where the more flexible topic model adapts better to the content of this particular document.

For larger numbers of topics, this crossover occurs for smaller values of N(train)d , since the topics pick out more specific areas of the subject domain.

One can see that making use of the authorship information significantly improves the predictive log-likelihood: the model has accurate expectations about the content of documents by particular authors.

Such a task requires computing the similarity between authors. To illustrate how the model could be used in this respect, we defined the distance between authors i and j as the symmetric KL divergence between the topics distribution conditioned on each of the authors:

The topic distributions for different authors can also be used to assess the extent to which authors tend to address a single topic in their work, or cover multiple topics. We calculated the entropy of each author's distribution over topics on the NIPS data, for different numbers of topics.

When compared to the LDA topic model, the author-topic model was shown to have more focused priors when relatively little is known about a new document, but the LDA model can better adapt its distribution over topics to the content of individual documents as more words are observed.

沒有留言:

張貼留言