2013年12月3日 星期二

Rzeszutek, R., Androutsos, D., & Kyan, M. (2010). Self-organizing maps for topic trend discovery. Signal Processing Letters, IEEE, 17(6), 607-610.

Rzeszutek, R., Androutsos, D., & Kyan, M. (2010). Self-organizing maps for topic trend discovery. Signal Processing Letters, IEEE, 17(6), 607-610.

就多筆文件資料以及多個詞語的語料庫而言,可以定義一個詞語-文件的共現矩陣(co-occurance matrix), C,紀錄每個詞語在每筆文件中的出現次數。LDA將矩陣C分解成兩個矩陣Φ和Θ。Φ表示主題-詞語的可能性,其中Φ的第k個向量ϕk中的每個元素即為每個詞語在第k個主題上的出現分布。Θ則是文件-主題可能性,它的第d個向量θd上的元素代表文件d包含各主題的可能性。進一步來說,Θ定義了一個文件空間(document space),這個空間上的每一個維度描述相對應主題在文件上的重要性,當文件彼此在意義上相似的話,他們在文件空間上也會相當接近。如果將研究的文件資料依照發表時間區分成若干的時段,每一個時段tw的主題分布情形θ(tw)可以用這個時段內所有文件的主題分布情形的平均值代表,如下面的式子



在這裡,ND(tw) 是在時段tw內發表的文件數量。過去 [5]便曾經利用LDA對於科學論文進行主題模型的研究,從產生的圖形表現出全球暖化研究在十年間逐漸增加的趨勢。然而有時主題的數量多達100個,若是同時將所有的主題在時間上的變化顯示在圖形上,在分析上便很難觀察每一個主題變化的模式(patterns)。
本研究建議將 自組織映射圖(self-organizing map)的視覺化功能結合LDA模型以追蹤主題在時間上的變化情形,藉由觀察自組織圖上反應(response)模式的變化,可以詳細地了解資料集中內容的改變情形。Kohonen的自組織映射圖 [8]能夠將高維度的資料映射到較低維度的格狀(lattice)圖形上,因此本研究嘗試運用這項技術來解決主題數量龐大時的視覺分析問題。自組織映射圖包括訓練(training)與分群(classification)兩個階段。訓練時重複地隨機選取文件資料,比較每一個節點與文件的相似程度,選取最相似的節點,使選取的節點與鄰近地區的節點都往文件的主題方向調整,訓練階段完成後便能使自組織圖代表全部的文件空間,而每一個節點則能代表主題相似的節點。有別以傳統使用所有資料來訓練自組織圖,本研究僅從文件資料集中隨機抽取500筆文件進行訓練,以減少計算的負荷,並且更大的不同是以KL差異(Kullback-Liebler divergence, KL divergence)取代常見的Euclidean距離(Euclidean distance)做為比較文件主題分布與節點主題分布的相似程度。兩個文件在文件空間上的向量分別是A與B時,它們的KL差異定義為
然而KL差異不符合三角形不等式(triangle inequality)與對稱律,也就是KL(A, B) <> KL(B, A)。為了讓文件間的相似性測量具有對稱性,本研究建議使用下面的式子

分群時將資料集上文件依照發表時間映射到對應時段的自組織圖上,因此每一時段會產生一個自組織圖的分群結果,對圖上的每一個節點nj定義一個分群密度 D(nj)

此處Nj是該節點nj上的文件數量,NmaxNmin分別是該時段分群結果的自組織圖上所有節點的文件最大與最小數量。最後根據這些結果產生二維圖形。針對個別時段的自組織圖分群圖形可以發現該時段內的重要主題,連續觀察所有時段的圖形則可以看出各種主體的變化情形。

本研究以2009年五月到八月ESPN.com與TSN.ca的25754筆體育新聞與評論的RSS feed為分析資料。

In order to track changes in topics over time, simple time-series techniques have been applied to a corpus analyzed using LDA [5]. For instance, in [5], the authors show how scientific papers on global warming gradually increase in popularity over a ten year period. Unfortunately it is not uncommon to have 100+ topics (i.e., dimensions) in a descriptor which makes this analysis difficult for any more than two topics.

Therefore, we propose to use a method similar to the WEBSOM [6] and ProbMap [7] algorithms to perform the trend analysis. These algorithms use Kohonen’s Self Organizing Maps [8] to nonlinearly project a high dimensional feature space onto a low-dimensional output space. Our method merges the idea behind WEBSOM and ProbMap with the work done in [5] to show how a document corpus can change with time.

Therefore, given a corpus, it is possible to define a word-document co-occurance matrix, C, that relates the number of times a given word occurs for a given document. It is desirable to find a way to decompose C because a corpus may contain millions of documents and thousands of words.

Latent Dirichlet Allocation probabilistically decomposes C such that


where Φ and Θ are matrices that express how words, topics and documents are related. Φ contains the topic-word likelihoods or, put another way, how likely any given word is to appear in topic k. Θ is a collection of document-topic likelihoods which relates how likely a document is to contain topic k.

Therefore, for each topic k, there is a vector, ϕk, that contains the word distribution for that topic. Similarly, there exists a vector θd for each document d that contains the topic distribution for that document.

The topic-distribution matrix, Θ, acts as a natural descriptor for all of the documents in the corpus. For any document, d, its associated vector θd  describes its location in a document space.

LDA ensures that documents that are semantically similar (i.e., share many words that are in the same topics) will be close to one another in the document space.

A common choice of divergence in this sort of situation is the Kullback–Liebler (KL) divergence and it is defined for two-dimensional PMFs, A and B as

The KL divergence is not a distance measure (i.e., metric) so it does not satisfy the triangle inequality and KL(A, B) <> KL(B, A). For convenience, it is useful to use a symmetric KL divergence so that the order of the arguments is not important. The symmetric KL measure is defined as


For this paper, a corpus was constructed of 25 754 documents that were collected over a three-month period starting in late May 2009 and ending in late August 2009. The documents were article summaries from RSS feeds from sports news and opinion websites such as ESPN.com and TSN.ca.

The values of the hyperparameters were taken from [5] and NT = 40 since we found that this number of topics provided the best tradeoff between descriptor length and the ability to effectively describe the corpus.

The moving average approach simply produces a document descriptor that is the average descriptor for any particular time window. For each window, tw, we obtain a descriptor, θ(tw), such that



where ND(tw) is the number of documents inside of the time window at time tw. θ(tw) then represents the central tendency of the documents inside of that time window.

Unfortunately, it is very difficult to extend this sort of analysis to more than just two or three topics. Consider Fig. 3. All of the topics are shown on the same plot and it is extremely difficult to visually see any underlying patterns. More importantly, this type of trending only shows the most dominant document types at any point in time.

The Self-Organizing Map (SOM), as proposed by Kohonen [8], maps a high-dimensional feature space onto a lower dimensional representation (usually one or two dimensions).

This allows the map to perform a nonlinear dimensionality reduction on a dataset. However, it can also be used as a classifier, which is what we do in this paper. By examining how many data points each node classifies, it is possible to map complex structures in the feature space onto the lattice.

The first stage trains the SOM on a small, random subset of the original input data. For the dataset used in this paper, that subset is 500 randomly selected documents. This is done since we do not need the SOM to accurately model the entire dataset, just loosely resemble it. This has the added benefit of reducing the computational burden when training the SOM, especially for very large datasets. After training, the SOM will now resemble Θ such that each node is actually the representation of a cluster of similar documents (Fig. 4).

As discussed in Section II-B, KL-divergence is used for the “distance” measure since it well suited to describing the dissimilarity between the document probability vectors.

The second stage filters, or processes, the dataset through the SOM using the sliding window method described in Section III-B. We use the trained SOM as a classifier to determine how many documents in the window are classified by each node. Each node has an associated classification count, Nj, or the number of documents that are associated with node j. A document is associated with a node if nj is the closest node to that document vector.

We define a classification density, D(nj), such that



where Nmax and Nmin are the maximum and minimum count values in the window. This ensures that the map is normalized to be the range of [0,1] so that different windows can be compared.

As the map responses change over time, it defines a 3-D volume (Fig. 7). This volume describes how the map responds over time, as opposed to just observing the response of the SOM at any particular time. As before, the clustering properties of the SOM makes it possible to determine how the document distribution itself changes.

沒有留言:

張貼留言