Pirolli, P. (2008). A Probabilistic Model of Semantics in Social Information Foraging. In AAAI Spring Symposium: Social Information Processing (pp. 72-77).
本研究利用主題模型(topic model),對於社交媒體Lostpedia上的發文,分析主題,以視覺化的方式呈現主題間的關係,並探討與社交資訊的搜尋與意義建構(social information foraging and sensemaking)在時間上的演變能否反應現實的事件。本研究共分析18,238 筆Loatpedia上的頁面,詞語的種類共有34,352種,將每頁面上各種詞語的出現次數建立成矩陣,其中的809,444為非0值 。以Gibbs sampling技術訓練主題模型,將主題數T設為20,訓練次數為500次,另外兩個重要參數α與分別設為T/50與200/W。在得到訓練結果後,以每個主題上機率最高的10種詞語來代表該主題,並且以KL差異測量(KL divergence measure)評估主題間的相似程度,將測量結果輸入多維尺度(multidimensional scaling)分析產生二維圖形,藉以觀察主題間的關係。另外,本研究測量每周最為活躍的詞語,藉由這些詞語找出被感興趣主題的變化起伏。
Over the past decade, the Web and the Internet have evolved to become much more social, including being much more about socially mediated information foraging and sensemaking. This is evident in the growth of blogs [10], wikis [20], and social tagging systems [16]. It is also evident in the growth of academic interest in socially mediated intelligence and the cooperative production of knowledge [2, 18, 19].
With that motivation, this paper presents extensions of Information Foraging Theory [14] aimed at modeling the acquisition of semantic knowledge about some topic area at the social level. The model is an attempt to start to address the question: how do the semantic representations of a community engaged in social information foraging and sensemaking evolve over time?
The Topic Model has been mainly directed at modeling the gist of words and the latent topics that occur in collections of documents. It seems especially well suited to modeling the communal production of content, such as scientific literatures [8], or wikis, as will be illustrated below.
The Topic Model assumes there is an inferred latent structure, L, that represents the gist of a set of words, g, as a probability distribution over T topics. Each topic is a distribution over words. A document is an example of a set of words. The generative process is one in which a document (a set of words) is assumed to be generated as a statistical process that selects the gist as a distribution of topics contained in the document, and then selects words from this topic distribution and the distribution of words within topics. This can be specified as
where g is the gist distribution for a document, wi is the ith word in the document, selected conditionally on the topic zi selected for the ith word conditional on the gist distribution. In essence, P(z|g) reflects the prevalence of topics within a document and P(w|z) reflects the importance of words within a topic.
This generative model is an instance of a class of three-level hierarchical Bayesian models known as Latent Dirichlet Allocation [4] or LDA. The probabilities of an LDA model can be estimated with the Gibbs sampling technique in Griffiths and Steyvers [8].
Topics are represented as probability distributions over words, and document gist is represented as probability distributions over topics.
In the current modeling effort, there were two simple aims: (a) understanding the semantic topics that underlie a community wiki, and (b) understanding the changes in production of topic knowledge over time in reaction to events in the world.
The latent topics in this communal system surely change over time as new evidence and mysteries are presented in the world (mainly via the Lost program, but also via other media and information releases by writers and producers).
A Topic Model was estimated from a Lostpedia database containing page editing information. The database covered the period from September 21, 2005 to December 13, 2006. That period covers Season 2 and part of Season 3 of Lost.
Over that period, 18,238 pages were created and 160,204 page revisions were performed. The mean number of revisions per page was 8.78 revisions/page and the median was 2 revision/page.
The word activity table contains D = 18,238 pages and W = 34,352 distinct words.
The resulting word-document co-occurrence matrix of 18,238 pages and 34,352 words contained 809,444 entries.
LDA was performed using Gibbs sampling (see the Appendix) to produce T = 20 topics. Exploration of other values of T (both larger and smaller) were generally consistent with the T = 20 model, but this number of topics is about the right size to discuss in a paper. The sampling algorithm was run for 500 iterations with parameters α = T/50 and β = 200/W (see Appendix).
In the results of this estimation, each topic is represented as an estimated probability distribution over words. One way to understand each topic is to examine the most probable words associated with each topic.
Since the topics are represented as probabilities, the inter-topic similarities (or dissimilarities) can be defined using a KL divergence measure (see Appendix) and submitted to a multidimensional scaling (MDS) analysis that can be used to plot the topics as points in a lower-dimensional space such that the spatial distance among points reflects (as best possible) the dissimilarities among topics.
One way to assess the value of the Topic Model is to see what kinds of trends it detects in attention to topics over time, and to see how the model assessments relate to what we know about releases of information in the world of Lost.
The strength of a topic was determined as the proportion of weekly word activity assigned to that topic compared to all topics.
The Topic Model has been applied to the analysis of scientific literatures and used to make sense of trends in the scientific literature. It appears that the Topic Model can be used to analyze how a socially mediated sense making site can be similarly analyzed to reveal the waxing and waning of interest in semantic topics in reaction to events in the world.
The Topic Model was applied to the Lostpedia wiki and found to reveal a sensible set of semantic topics and tracked interest in those topics in reaction to events in Lost.
P(w|z) is represented as a set ϕ of T multinomial distributions over the W words such that
where zi is the topic associated with the word wi in document di and the jth topic is represented as a multinomial distribution with parameters ϕ(j).
P(z) is represented as a set θ of D multinomial distributions over the T topics, such that for each document there is a multinomial distribution with parameter θ(di), such that for every word wi in document di the topic zi of the word is
The Dirichlet distribution is the Bayesian conjugate for the multinomial distribution and provides Dirichlet priors on the parameters ϕ(j) and θ(di).
The Topic Model provides estimates of T multinomial distribution parameters, one for each topic such that the probability of any word w conditional on a topic j is defined as
The Topic Model also estimates a set of document probability distributions for over topics for each of the document d.
沒有留言:
張貼留言