本研究提出一個TTR-LDA-社群模型,這個模型以推論機制(inference mechanism)結合LDA(Latent Dirichlet Allocation)模型和Girvan-Newman社群偵測(community detection)演算法提供在網路資料上偵測社群並對這些社群進行主題探勘(topic mining)的功能,並且進而了解在社群上的主題隨時間推移的變化,處理的架構如下圖所示
本研究利用Delicious社會標籤系統(social tagging system)上從2005到2008年的資料進行研究。在社群偵測部分,首先建立網絡:根據使用者標籤的資源數量,選取前50000位標籤資源最多的使用者;然後對他們標籤的網頁進行統計,從其中選取10000個被最多使用者標籤的網頁。接著在上述的10000個網頁中,如果有這50000位使用者之間有任何兩位曾經標籤過相同的網頁,便在這兩位使用者之間產生一個連結。以50000個使用者為節點,同時以他們之間的連結為連結線,便可以建立一個共同書籤網絡(co-bookmark network)。並且為了研究網絡上社群結構的變化,並將整個期間的資料分為三個時段:分別為2005-2006、2007與2008年,相關的統計數據如下表:
本研究利用Girvan-Newman演算法找出標籤者(Tagger)社群,使得標籤者與同一社群內的其他標籤者比社群外的標籤者有較強的關係。這個演算法重複移去當時網絡上中介性(betweenness)最大的連結線,產生各種可能的網路劃分(network partition),測量每一種劃分下的群組性(modularity),也就是實際上社群內的成員彼此間的連結線數量與相同連結度的情況但隨機產生連結線的數量的差,群組性最大的劃分便是輸出結果。
另一方面,本研究利用TTR-LDA模型找出每個標籤者的主題分布以及主題內具有代表性的標籤,TTR-LDA模型修改自ACT(author-conference-topic)模式[11][12],是一個由標籤者做為第一層、標籤與資源為第三層、主題則為第二層,所構成的三層貝氏模型(three-layer Bayesian model)。
整合Girvan-Newman演算法和TTR-LDA模型的方法是以社群為單位,將社群內所有標籤者的主題分布進行平均做為該社群的主題分布,根據主題分布,選出機率值較大的主題做為社群的代表。比較不同時段社群共同的代表標籤衡量它們的相似性。
社群偵測的結果發現前五個最大的社群在四年裡占了絕大多數的比率,而且這個比率逐年增加。主題探勘的部分則測量TTR-LDA模型在不同主題數量上的複雜度(perplexity),發現150個主題時有最低的複雜度。因此,以下的研究便針對150個主題在前五個最大社群上的分布進行探討,計算它們的傳導性(conductance)與模組性。就社群模組性而言,最近一個時段(2008年)的結果比前三個時段還要高,其原因可能是因為經過一段時間後,社群的結構逐漸成熟,因此後期比前期更能產生較佳的社群。另外,將最後一個時段再細分為四個較小的時段則發現,較小時段的社群模組性比整年的結果來得高,本研究認為造成這種現象的原因可能是由於在不同的時段,大部分標籤者的書籤行為集中在不同的領域;當那些時段合併起來的時候,會展現標籤者在多個領域的興趣,使得社群內的叢集(clustering)特性較弱。
結果並可以發現前二十個主題都出現在不同時段的前五個最大社群裡,每個社群至少包括一個前十名的主題。此外,比較LDA、TTR-LDA和TTR-LDA-社群等三種模型在資源和標籤上的預測力,在回收率(recall rate)、精確率(precision)和F1指標上以TTR-LDA-社群為最佳。
In this paper, we propose a TTR-LDA-Community model which combines the Latent Dirichlet Allocation model (LDA) and the Girvan-Newman community detection algorithm with an inference mechanism.
The model is then applied to data from Delicious, a popular social tagging system, over the time period of 2005-2008.
Our results show that 1) users in the same community tend to be interested in similar set of topics in all time periods; and 2) topics may divide into several sub-topics and scatter into different communities over time.
From a research perspective, these real-world networks display unique properties from the classical random graph model [3] in that most real word networks exhibit three common properties: the small-world property, power-law degree distribution and a high clustering coefficient or transitivity (indicating community structure) [7][8][9].
Thus, an important task in network analysis is to detect communities and explore their features, which can improve community-supporting services at the community-level in the context of a social tagging system.
Many studies in various disciplines have been devoted to community detection; however, few of them have systematically and quantitatively studied the profiles of those detected communities.
In this paper, we propose a TTR-LDA-Community model, which is an inferential combination of an extended LDA model and a betweenness-based community detection algorithm. It provides rich, systematic, and quantitative information about the profiles of detected communities.
In the context of social tagging systems, where multiple users are annotating resources, the resulting topics reflect a shared view of the document; and the tags of the topics reflect a common vocabulary.
Girvan and Newman extended the betweenness measure to edges and designed a clustering algorithm which gradually removes the edges with the highest betweenness value [4]. This algorithm has been improved through modularity; and the complexity is reduced from O(m2n) to O(mdlogn) where d is the depth of the dendrogram of the community structure [2].
Many studies provide various models and algorithms for topic mining and community detection; yet, few of them have integrated those models and algorithms, performed topic mining for detected communities, and analyzed how those identified topics change among communities over time.
The activity of social tagging consists of three major components: tag, tagger and resource. The experimental dataset contains all the triples of these three components and the time and date of their creation on Delicious from 2005 to 2008.
In data processing, all taggers were ranked by the number of resources they have bookmarked and the top 50,000 taggers were selected as the sample of taggers.
These taggers bookmarked a total of 354,522 web pages, which were sorted by the number of taggers who bookmarked them. The top 10,000 resources were selected as the sample of web pages, associated with which a dominant majority of tagging activities occurred.
Thus a co-bookmark network was built in which a connection between two users (within the sample of 50,000 taggers) is created if they bookmarked the same resources (within the sample of 10,000 web pages).
In addition, in order to observe the evolution of structure and motif of communities, the time span (2005-2008) was divided into three slices.
The model is illustrated in Figure 1. TTR-LDA is developed based on ACT model [11][12]. It is a three-layer Bayesian model with taggers tap in each post p as the first layer, tags t, and resource r as third layer and all the topics denoted as latent variable z as the middle layer.
The inference mechanism is used to infer the topic distribution over detected communities.
Each community includes a set of taggers, who have a stronger relationship with other taggers within the community than the taggers outside.
Based on the taggers’ information model, the probability distribution of each tagger over a set of topics is obtained by using the TTR-LDA model while the community structure of taggers is revealed by the community detection algorithm. The two sets of results are further integrated through an inference mechanism.
Results show that the number of users of the top five communities occupies a major proportion in the four years (2005-2008) and the proportion is increasing over time.
Perplexity is used to identify the number of topics [10], which arrives at the lowest point when the number of topics is 150. The interest model of each tagger in the top five largest communities is then built based on their topic distributions.
By using users’ interest models and the inference mechanism, a topic distribution of the largest community can be created (Figure 3). We can find that the topic distributions in a community are diverse because users’ relationships in that community are mainly based on their co-bookmark activities not the similarity of their interest model.
In order to observe the dynamic features of communities, we design an experiment as follows:
1) denote the five largest communities from each time slice in 2008 as community_i_t where t means the tth time slice in 2008 and i means the ith largest community in tth time slice;
2) compute the topic distribution for the five communities, which is stored as model_t_i_Topic(j), the
probability of jth topic in ith largest community in the tth time slice;
3) obtain the probability distribution of tags that are collected from all the posts generated during the specific time slice; the probability of one tag occurring in a topic shows the level of representativeness of the tag for that topic;
4) sort all the tags according to their probability value in each topic and select the 20 top ranked tags to represent the content of the topics; select the top 5 ranked topics to represent the theme of each community;
5) analyze the similarity between different communities from different time slice through computing how many tags are shared by the two different communities. More specifically, we compare current time slice with its previous time slice, for example, we compare community_i_t with community_j_t-1 (j=1, 2…5).
The size of communities along evolutionary lines fluctuates over time. For example, the size of the community about social networks in the 3rd time slice (community_1_3) is much larger (4,377) than that (521) in the 4th time slice (community_5_4).
Conductance (from multi-criterion scores) and modularity (from single criterion scores) are used to evaluate the quality of communities detected by the TTR-LDA-Community model [6].
The smaller the value of conductance is, the higher the granularity of a community is. Network community profile (NCP) is used to compute and display the value of conductance for communities [5].
Whiskers networks and rewired networks are adopted as two comparative aspects. Whiskers is defined as the maximal sub graphs that can be detached from the rest of the network by removing a single edge; and a rewired network is a random network that has the same nodes and the same degree distribution as the original network [5].
The conductance of communities of the rewired original network (blue line in the left figure), rewired random network (red dashed line in the left figure), the original whiskers network (blue line in the right figure), and the random whiskers network (red dashed line in the right figure) are calculated and shown in Figure 4.
In Figure 4, compared with the rewired network (left) and the rewired whiskers (right), 1) the original network displays a higher granularity of communities (a lower conductance value); 2) the value of conductance as the function of the size of communities in the original network and the original whiskers present a “V” shape, showing properties of a true large social networks [5]; 3) the original whiskers has the best community granularity (the lowest conductance) between size 10-100; and 4) the best community granularity of rewired original network is around 1000.
The modularity of communities in the four time slices of 2008 is better than that in 2005-2007. This is probably due to the fact that community structure grows mature gradually over time, creating better communities in later years than in earlier years.
Meanwhile, modularity of communities in the short-term (four sub periods in 2008) is larger than the long-term (2008). It can be explained that in different time periods, most taggers’ bookmarking activities are focused on different domains, so in a certain short-term time period, communities may be quite different from each other. However, when those time periods are merged together, the taggers show different interests in many domains; so the clustering feature within the communities becomes weaker.
Results show that the most popular topics are about bandslash fiction, fan fiction, and supernatural fiction (the top 3 popular topics). Communities with similar theme are ranked 3rd, 4th, and 5th in size; and the web resources with similar topics are ranked 500-600 of the top 1000 ranked resources in number of taggers associated with them.
The top 20 ranked topics in 1000 most popular resources can be found in 5 largest communities in different time periods. For each community, there exists at least one topic that is ranked top 10 in 1000 most popular resources (Table 3).
Topic distributions for each community are obtained respectively from LDA, TTR-LDA model, and TTR-LDACommunity model based on co-bookmark network in a given period (Oct. 2008–Dec. 2008). One resource and five tags are recommended for each post according to the results of three models separately.
The TTR-LDA and TTR-LDA-Community model show significant improvement for recommendation of tags and resources for post in terms of precision, recall and F1-Measure. TTR-LDA and TTR-LDA-Community have slightly improved performance for “tags for post”, while TTR-LDA-Community outperforms TTR-LDA on “resource for post”.
沒有留言:
張貼留言