2014年6月13日 星期五

Bache, K., Newman, D., & Smyth, P. (2013, August). Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 23-31). ACM.

Bache, K., Newman, D., & Smyth, P. (2013, August). Text-based measures of document diversity. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 23-31). ACM.

Scientometrics

本研究提出一個利用文本內容測量多樣性的架構,其原理是先以文件語料庫訓練主題模型,然後根據主題在文件上的共現情形,計算主題之間的距離,最後再測量文件的多樣性。這個架構的優點是只需要以文件語料庫的文本資料做為輸入,完全是資料驅動(data-driven)的方法;產生人類可讀的(human-readable)結果;並且能夠廣泛地擴大到作者、學系和期刊等應用。
計算某一個群體的多樣性(diversity)程度,目前已經廣被各種學科關注,例如生態學(ecology) [9]、遺傳學(genetics) [12]、語言學 (linguistics) [8]和社會學(sociology) [5]。在評估群體的多樣性時,通常假定群體內的個體可分為T個類別,每個類別的比例為pi:
Stirling [22]提出多樣性應包含三個面向:類別數目(variety)、比例之間的平衡(balance)以及類別間的差異(disparity)。
1. 群體內包含的類別數目:是相對簡單的多樣性測量方式,也就是π裡非零比例的數量。

2. 比例之間是否相對平衡,常用的測量方式包括Shannon的熵(entropy),或是變異量

3. 群體上呈現的類別彼此間的差異。
Stirling [22]並說明這三個面向都是多樣性量化的必要條件,但不是充分條件,而由Rao [18]提出的公式可以將三個面向整合起來,

此處pi和pj分別類別i和j在群體中的比例,δ (i, j) 是類別i和j的距離,Δ是 TXT的距離矩陣,t表示矩陣轉置運算。

Rafols and Porter [14]利用引用文獻的ISI 期刊主題分類(journal subject categorizations)分析6個特定的主題分類在1975到2005年間的跨學科情形。在他們的研究中,是參考文獻發表期刊的主題類別(subject category) i,δ (i, j)定義為1減去主題類別i和j的引用計數向量(citation count vector)的cosine距離。結果發現雖然引用文獻數量與共同作者數有明顯的增加,跨學科性的程度以緩慢的速度增加。這種方法仰賴事先定義的分類而有限制 (Rafols and Porter [15]),因為主題分類可能隨時間改變,而無法反應當時的學科界線。當分析的資料沒有適合的分類方式時,此時也會受到限制。此外,引用資料能否準確反應科學文獻的內容也頗受爭議。

In this context we can define Rao's diversity measure for each document d as

本研究提出一個文本為基礎的研究架構,以文件的內容來測量它的多樣性程度。此一方法利用Latent Dirichlet Allocation (LDA)的主題模型(topic model)方法從文件的語料庫(corpus)推論出T個主題,最後產生出一個DXT的矩陣,D是文件的數量,矩陣上的元素ndj表示文件d上的詞語指定給主題j的比例。利用此一矩陣以Cosine距離計算每一對主題間的距離矩陣,然後結合這些距離和文件上的主題分布,利用Rao [18]提出的公式估算文件的多樣性。
此一方法是完全資料驅動(data driven),產生容易解讀的結果,並且能夠將其推廣應用於作者、學術部門以及期刊的多樣性測量。


In this paper we present a text-based framework for quantifying how diverse a document is in terms of its content.

The proposed approach learns a topic model over a corpus of documents, and computes a distance matrix between pairs of topics using measures such as topic co-occurrence. These pairwise distance measures are then combined with the distribution of topics within a document to estimate each document's diversity relative to the rest of the corpus.

The method provides several advantages over existing methods. It is fully data-driven, requiring only the text from a corpus of documents as input, it produces human-readable explanations, and it can be generalized to score diversity of other entities such as authors, academic departments, or journals.

The quantification of diversity has been widely studied in areas such as ecology [9], genetics [12], linguistics [8], and sociology [5].

The typical context is where one wishes to measure the diversity of a population, where a population consists of a set of individual elements that have been categorized into T types (such as species), with proportions 

A relatively simple measure of diversity is variety, how many different species are present in a population, or the number of non-zero proportions in π.

One can alternatively measure diversity as a function of the relative balance among the proportions (also referred to as `evenness' in ecology [13] or `concentration' in economics [4]), using measures such as Shannon entropy or variance-based quantities such as 

From a more general perspective, Stirling [22] proposed that there are three distinct aspects to diversity: variety, balance, and disparity.

Disparity is the extent to which the categories that are present are different from each other, based for example on distance within a known taxonomy [21].

Stirling argued that each of these three properties is a necessary (but non-sufficient) component in any quantitative characterization of diversity, arriving at a relatively simple mathematical formulation for diversity, a formulation originally proposed in earlier work by Rao [18]:


where pi, pj are the proportions of category i and j in the population, δ (i, j) is the distance between categories i and j, Δ is a TXT matrix of such distances, and t is the transpose of the TX1 vector of proportions .

The contribution of this present paper is to investigate diversity in the context of text documents, using Rao's measure a starting point.

In particular, we will use words as elements, topics as word categories, and documents as collections (or "populations") of words. Specifically, we address the following task: given a corpus of documents, assign a diversity score to each document, where this diversity score can be used to rank documents from most to least diverse.

Indeed, diversity as defined via co-citation counts is the most widely-used approach to quantify interdisciplinarity in practice, based on the notion that disciplines that are co-cited more often by the same article are "closer" than disciplines that are less frequently co-cited.

Journal subject categories are typically used to capture the notion of a discipline, typically using the manually-defined 244 ISI subject categories from Thomson Reuters, with articles being assigned to a subject category associated with the journal the article is published in (e.g., [15, 14, 17, 23]).

Rafols and Porter [14] used journal subject categorizations of citations to analyze how interdisciplinarity has changed between 1975 and 2005 for six specific subject-categories. They concluded that although the number of citations and co-authors per paper was increasing significantly over time, the degree of interdisciplinarity was increasing at a much slower rate, as reflected by citation patterns between subject categories. As a component in their analysis, Rafols and Porter used Rao's diversity index based on a count matrix of D documents by T categories derived from citations: pi was the proportion of citations made by an article to other articles that were published in journals belonging to subject category i, and δ (i, j) was defined as 1 minus the cosine distance between citation count vectors (across documents) of subject categories i and j.

Our work differs from this earlier work and related threads in scientometrics in two specific ways. First, in our approach the categories and distances, δ (i, j), are learned directly from the text content, rather than being based on manually predefined schema such as the ISI subject categories. ...  The second major difference in our approach is our use of word counts rather than citation counts (which are the basis of most prior work in scientometrics on quantifying interdisciplinarity). 

There are obvious limitations to relying on pre-defined taxonomies, as pointed out by Rafols and Porter [15]. Subject categories can change over time and no longer necessarily reflect current disciplinary boundaries.

In addition, in some contexts such as analysis of proposals and grants, there may be very limited or no categorizations available. For analysis of narrow domains (say the field of data mining and machine learning) existing categorization schemes may be too coarse-grained to be useful. In this context, a corpus-driven approach to learning the categories, such as the topic-based method we describe here, is a useful alternative, and in some cases may be the only option.

We expect that using text content will complement citation-based approaches, as both words and citations carry useful signal. There has long been debate over whether citations accurately reflect the content of a scientific article [2, 1]-- arguably the words in an article may provide a more accurate reflection of the author's intentions than the citations the author uses.

Another field which is related to our current work is that of outlier detection. If we consider documents as being represented by T-dimensional vectors of counts, then one approach to quantifying diversity is to look for documents that are outliers in this T-dimensional space, using a multivariate outlier detection algorithm. ... Equivalently, since the pi are the components of a probability vector in a T - 1 dimensional simplex, we can think of high diversity documents as points that lie in the interior of the simplex (in at least 2 of the dimensions) rather than at the edge.

We use the Latent Dirichlet Allocation (LDA) topic model with collapsed Gibbs sampling to learn T topics for the D documents in the corpus [7]. A single iteration of the collapsed Gibbs sampler consists of iterating through the word tokens in the corpus, sequentially sampling topic assignments for each word token in each document while keeping all other topic-word assignments fixed. Using the topic-word assignments from the final iteration of the Gibbs sampler , we create a DX T document-topic count matrix with entries ndj corresponding to the number of word tokens in document d that are assigned to topic j.

In this context we can define Rao's diversity measure for each document d as

where P(j|d) is the proportion of word tokens in document d that are assigned to topic j and δ (i, j) is a measure of the distance between topic i and topic j. Note that δ (i, j) is constant across all documents, and P(i|d) and P(j|d) vary from document to document.



We presented an approach for quantifying the diversity of individual documents in a corpus based on their text content. Empirical results illustrated the effectiveness of the method on multiple large corpora.

This text-based approach for assigning diversity scores has several potential advantages over previous alternatives, such as methods that define diversity based on citations categorized into predefined journal subject categories. The text-based approach is more data-driven, performing the equivalent of learning journal categories by learning topics from text, and can be run on any collection of text documents, even without a prior categorization scheme.

In addition, it produces human-readable explanations and can be easily generalized to score the diversity of other entities such as authors, departments, or journals (e.g., by aggregating counts across such entities).

沒有留言:

張貼留言