2013年11月16日 星期六

Mimno, D., & McCallum, A. (2007, June). Mining a digital library for influential authors. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 105-106). ACM.

Mimno, D., & McCallum, A. (2007, June). Mining a digital library for influential authors. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 105-106). ACM.

本研究利用下面的公式來找到一個領域中具有影響力的作者
在式(1)裡,q代表用來查詢領域的問句,a代表用領域內的任一位作者,Pr(a|q)便代表給定問句找到作者a的可能性。d則是領域內的文件。Ad對於文件d而言代表其作者的集合,I{a屬於Ad}表示a是否為文件d的作者,1/(|Ad|)I{a屬於Ad}是任何一位作者在文件d上的機率Pr(a|d)。Pr(d)是個別文件的影響力,本研究根據Chen et al. [2]所建議的方式,利用PageRank演算法進行測量,代表研究人員根據參考文獻閱讀到這篇文件的可能性。

本研究比較三種估算Pr(q|d)的方式:
第一種方式利用Dirichlet平滑化的語言模型(language model),以下面的公式來計算:

此處w是文件d中出現的某一個詞語,Nwd是這個詞語出現在這個文件中的次數,Nw則是這個詞語出現在語料所有文件的次數。Nd是文件所所有詞語出現次數總和,N表示語料中所有詞語的總共出現次數。μ是一個調整變數,本研究將它設為100。

第二種方式從LDA(latent Dirichlete allocation)的主題模型(topic model)中挑選一個最符合問句q的主題t,以Pr(t|d)代替Pr(q|d)。

第三種方式同樣使用LDA的主題模型,但以下面的公式來估算Pr(q|d):

在主題模型的方式中使用LDA演算法的原因是它可以有效解決同義詞(synonymy)及同形詞(polysemy)的問題,因而將它視為是一種問句擴展(query expansion)的方法。LDA將文件視為各種主題的混合(mixtures),並且將每一個主題視為是詞彙上各個詞語的比例分布,而其原理便是利用語料中在文件裡經常一起出現的詞語來估測主題上各詞語的分布機率Pr(w|t),以及給定文件時出現各種主題的機率Pr(t|d)。

本研究以資訊檢索領域作為分析案例,結果發現第三種方式在發現具有影響力的作者上得到較好的結果。

We present a probabilistic model that ranks authors based on their influence in particular areas of scientific research. This model combines several sources of information: citation information between documents as represented by PageRank scores, authorship data gathered through automatic information extraction, and the words in paper abstracts.

Authors are coreferenced using Machine Learning methods. Previous studies of author influence in digital library collections have been hampered by ambiguities in authorship (for example, Newman [4] groups authors by first initial and last name). Finally, the link structure of the collection is identified by extracting and disambiguating the references from papers.

We measure the influence of individual documents using the PageRank algorithm. Chen et al. [2] demonstrate the use of PageRank on research literature, using references in place of hyperlinks. PageRank can be thought of as modeling a researcher who moves from paper to paper in the document collection. At each paper, the researcher either follows a randomly chosen reference from the current paper or, with probability , chooses a random paper from the collection. The PageRank of a given paper can be interpreted as the probability that the researcher will be reading that paper at any given moment. Since the PageRank is a probability distribution over all documents in the collection, we use it as the probability of a given document, Pr(d).

For the probability of authors given documents we use a uniform distribution, dividing the weight of a document evenly between its authors. The probability of an author given a document is Pr(a|d) = 1/(|Ad|), where Ad is the set of authors in paper d.

Using these elements we can construct a distribution over authors for a particular query,



where I{a屬於Ad} indicates whether a is listed as an author for a given paper.

For the component of the model that depends on the words in documents, Pr(q|d), we compare three statistical models. The first is based on a language model with Dirichlet smoothing. The second and third are based on a statistical topic model, using a single topic and a weighted sum of topics, respectively.

For the language model with Dirichlet smoothing, the probability of a query given a document is



where μ = 100, Nd is the number of words in document d, Nwd is the number of times word w appears in document d, Nw is the number of times word w appears in the corpus, and N is the total number of tokens in the corpus.

For the topic model we use Latent Dirichlet Allocation (LDA) [1]. LDA models documents as mixtures of "topics", which are probability distributions over the vocabulary of the corpus. Topic models are useful in handling synonymy (multiple words with similar meanings) and polysemy (words with multiple meanings), because they assign words to topics based on the context of the document. ... A trained topic model produces an estimate of the probability of a word given a topic, Pr(w|t), and the probability of a topic given a document, Pr(t|d).

In this application, the topic model can be considered a sort of query expansion: documents that contain none of the query words may still contain words that commonly occur in the same contexts as the query words.

In the second model we select a single topic t that matches the query and substitute Pr(t|d) for Pr(q|d) in Equation 1.

In the third model we represent Pr(q|d) as a weighted sum over all topics:

The weighted topic model approach to expert finding appears to be better able to generalize beyond the specific query words, while retaining a focus on areas relevant to the query. We believe that such models are a promising direction in expert finding, and a good example of the usefulness of structured digital library collections.

沒有留言:

張貼留言