本研究以LDA (latent Dirichlet allocation) 演算法[17]做為社群發現(community discovery)的方法,也就是找出社會網路上在群體內高度相連但在群體間相對稀疏的行動者(actors)社群。在這個方法裡,社群被視為是LDA模型裡的隱藏變數,由所有行動者依不同比例組合成的混合(mixtures)。本研究認為這個方法的優點是只需要利用網路的型態資訊(topological information)便可進行社群發現;此外,有別於[2, 3, 5]等先前的方法,這個方法能夠描述行動者屬於多個社群的現象,並且在每個社群上有不同重要性。
本研究將所有與某個行動者互動的行動者以及他們之間的社會互動情形定義為這個行動者的社會互動特徵(social interaction profile, sip),例如行動者vi的社會互動特徵定義為
SIW(vi, wij)是行動者vi與另一位行動者wij間的互動情形,mi是與vi互動的行動者數目。
在利用LDA模型進行社群發現時,將每個行動者的社會互動特徵視為是原先模型中的文件,這些行動者發生的互動則是模型中的詞語,值得一提的是在社會網路上行動者間的互動順序是可以交換的(exchangeable),因此LDA模型相當適合這種情形。下圖是這個模型的圖示,
對行動者vi而言,其社會互動特徵sipi的產生過程(generative process)如下
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)
也就是
1) 從Dir(β)中產生行動者的組合ϕk來代表K個社群中每一個社群;
2) 從Dir(α)中選擇一個社群的組合θi,做為所有與vi互動的行動者所可能屬於的社群的分布情形;
3) 從Poisson(ξ)中選擇社會互動的個數Ni ;
4) 產生每一個社會互動 wij時:
a) 從Multinomial(θi)中選擇一個社群ιij;
b) 從Multinomial(ϕιij)中選擇一個社會互動 wij。
根據這個模型,行動者vi的社會互動特徵sipi中的第j個社會互動元素wij為wm的機率是
此處θi是sipi 的混合比變數(mixing proportion variable),而ϕk是第k個社群成分分布(component distribution)的參數集合。
給定超參數(hyperparameter)α和β後,所有已知和隱藏變數的聯合機率(joint probability)為
由於要精確地推論LDA模型的變數一般而言是很困難的(intractable),因此經常以 變異期望值最大化(variational expectation maximization) [17], 期望值延遲 (expectation propagation) [24]和 Gibbs取樣 (Gibbs sampling) [25], [26], [22]等三種方法取得近似解。本研究採用的方法是Gibbs取樣,以Markov鏈 Monte Carlo模擬(Markov-chain Monte Carlo simulation)的方式,去逼近ϕk,w和θm,k等變數
本研究針對CiteSeer和NanoSCI兩個書目資料集中的作者合著網路(co-authorship networks)上最大的成分進行社群發現,其中CiteSeer部分共包含249866個節點,而NanoSCI則有203762個節點。本研究並且提出了01-SIP、012-SIP和 k-SIP等三種描述行動者間的社會互動情形,01-SIP是描述兩個作者有直接合著的社會互動情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為1,否則便設為0;012-SIP則考慮兩個作者沒有直接合著但曾經分別和第三位作者合著的情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為2,雖然這兩位作者不曾合著,但曾有分別和第三位作者合著,便將互動情形設為2,否則便設為0;k-SIP則描述兩位作者多次合著的情形,k是他們合著的次數。研究結果發現在複雜度(perplexity)與各產生社群的緊密度(compactness of communities)等指標,012-SIP的成效都比其他兩者好,從社群發現的結果可以發現許多社群裡的作者是屬於相同的研究機構或是彼此具有相似的研究興趣,因此會共同研究並且合作發表而成為社群。
This paper describes an LDA(latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA(Simple Social Network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical model and defined as distributions over the social actor space. The advantage of SSN-LDA is that it only requires topological information as input.
This model is evaluated on two research collaborative networks: CiteSeer and NanoSCI. The experimental results demonstrate that this approach is promising for discovering community structures in large-scale networks.
An important task in these emerging networks is community discovery, which is to identify subsets of networks such that connections within each subset are dense and connections among different subsets are relatively sparse.
Unlike those previous community discovery studies, we design a hierarchical Bayesian network based approach, namely SSN-LDA (Simple Social Network-LDA) to discover probabilistic communities from social networks. ... In this model, communities are modeled as latent variables and are considered as distributions on the entire social actor space.
We also propose three different approaches to create social interaction profiles based on the social interaction information in the network.
Girvan et al extended this measure to edges and designed a clustering algorithm which gradually remove the edges with highest betweenness value [5]. ... However, a major problem with this approach is that the complexity of this approach is O(m2n), where m is the number of edges in the graph and n is the number of vertices in the network.
The graph partition problem can be formulated as the balanced minimum cut problem where the goal is to find an optimal graph partition so that the edge weight between the partitions is minimized while maintaining partitions of a minimal size. The NP-complete complexity of this approach [15] requires approximate solutions. Flake et al developed approximate algorithms to partition the network by solving s-t maximum flow techniques [2], [3]. The main idea behind maximum flow is to create clusters that have small inter-cluster cuts and relatively large intra-cluster cuts.
The major difference between SSN-LDA approach and the aforementioned approaches is that SSN-LDA is a mixture-model based probabilistic approach. Each community weighs in the contributions from every social actors and this property can be exploited in many potential applications that will be introduced in Section VI.
LDA model was first introduced by Blei for modeling the generative process of a document corpus [17].
Among these variants of LDA models, the approaches proposed in [20], [10] are both concerned about the authors of the documents in the corpus. In particular, Zhou et al introduced a community latent variable in their graphical model and applied it to discover community information embedded in document corpus. This approach can discover the underlying social network based on social interactions and topical similarity.
In their follow-up work[23], Zhou et al. investigated how research topics evolve over time and attempted to discover the most influential researchers involved in such transitions.
Each actor is characterized by its social interaction profile (SIP), which is defined as a set of neighbor(wij) and the corresponding weight(SIW(vi, wij)) pair.
where mi is the size of vi's social interaction profile.
Note that we consider the social interaction elements in this profile are exchangeable and therefore their order will not be concerned. It is this exchangeability that permits the application of LDA model [17].
Subsequently, we specify that a social network contains a set of communities ι (ι1, ι2, …, ιk) and each community in ι is defined as a distribution on the social actor space. In SSN-LDA, community assignments are modeled as a latent variable (ι) in the graphical model. The community proportion variable (θ) is regulated by a Dirichlet distribution with a known parameter α. Meanwhile, each social actor belongs to every community with different probabilities and therefore its social interaction profiles can be represented as random mixtures over latent communities variables.
The SSN-LDA model for social network analysis is illustrated in Fig. 1. Note that SSN-LDA resembles topic-based LDA model[17], with the social network being analogous to the corpus, the social interaction profiles being analogous to documents; and the occurrence of social interactions being analogous to words.
The distribution of topics in documents and the terms over topics are two multinomial distributions with two Dirichlet priors, whose hyperparameters are α and β respectively.
The dimensionality K of the Dirichlet distribution, which is also the number of community component distributions, is assumed to be known and fixed.
This generative process for an agent(wi)'s social interaction profile sipi in a social network is:
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)
According to the model, the probability that the jth social interaction element wij in the social actor wi's social interaction profile sipi instantiates a particular neighboring agent wm is:
where θi is the mixing proportion variable for sipi and ϕk is the parameter set for the kth community component distribution.
Given the hyperparameters α and β, the joint distribution of all known and hidden variables is:
Exact inference is generally intractable for LDA model. There have been three major approaches for solving this model approximately, including variational expectation maximization [17], expectation propagation [24], and Gibbs sampling[25], [26], [22].
Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) simulation[27] where the dimension K of the distribution are sampled alternately one at a time, conditioned on the values of all other dimensions[22].
we apply the Gibbs sampling algorithm that has been introduced in [26], [22] to solve the SSN-LDA model and reduce the computation requirement.
Subsequently, the update equation for the hidden variable can be derived [22]:
where n(.)⌐i is the count that does not include the current assignment of ιi and recall that sip is the variable for social interaction profiles. For the sake of simplicity, we assume that the Dirichlet distribution is symmetric in deriving the above formula.
Finally, the update formula for ϕk,w and θm,k are as follows:
In co-authorship networks, the vertices represent researchers and the edges in the network represent the collaboration relation between researchers. In this section we evaluate SSNLDA model on co-authorship networks collected from two distinct areas: computer science(CiteSeer) and nanotechnology(NanoSCI). Note that no name disambiguation has been done on either dataset.
The size of the largest connected subnetwork of CiteSeer is 249866 while the size of the largest connected subnetwork in NanoSCI is 203762. In this paper, we are only interested in discovering community structures in the two largest subnetworks.
In this paper, we explore three different types of social interaction pro le representations for social networks, namely 01-SIP, 012-SIP, and k-SIP.
In the 01-SIP approach, an edge is drawn between a pair of scientists if they coauthored one or more articles. Collaborating multiple times does not make a difference in this model
In order to mitigate this problem, we propose a 012-SIP model which takes a node's neighbors' neighbors into consideration. ... In this model, we distinguish a node's direct neighbors from its neighbors' neighbors by giving different weights to them.
This section describes a K-SIP model where the weight information for an edge is defined as the times of the collaboration between the two authors.
And then, 10% of the original datasets is held out as test set and we run the Gibbs sampling process on the training set for i iteration. In particular, in generating the exemplary communities, we set the number of the communities as 50, the iteration times i as 1000. In perplexity computation, i is set as 300 in order to shorten the computation time. In both case, α is set as 1/K and β is set as 0.01, where K is the number of the communities.
Table III shows 6 exemplary communities from a 50-community solution for the CiteSeer dataset with social interaction profiles being created using 012-SIP representation. ...These exemplary communities give us some flavor on the communities that can be discovered by this approach. Specifically, we observe that some communities are institution-based, some others are topic-based. ... This observation reveals the fact that researchers from same institution or with similar research interests tend to collaborate together more and build closer social ties.
Perplexity is is a common criterion for measuring the performance of statistical models in information theory. It indicates the uncertainty in predicting the occurrence of a particular social interaction given the parameter settings, and hence it reflects the ability of a model to generalize unseen data.
Perplexity PP is defined as
where wm is the social interaction profiles in the test set and
where n(v)m is the number of times term t has been observed in document m.
It shows that the perplexity value is high initially and decreases when the number of communities increases. In addition, the results show that the 012-SIP approach has lower perplexity value than the other two approaches.
Compactness of a community is measured through the average shortest distance among the top-ranked Nr researchers in this community. Short average distance indicates a compact community.
The t-test results show that the 012-SIP approach is significantly better the other two approaches for both datasets
The probabilities that can be derived from SSN-LDA model can be helpful in determining the importance and roles of community members. For instance, the importance of community members conditioned on the community variable ιj can be measured through the probability p(wi|ιj ), which can be easily derived from this model based on the learned ϕ.
SSN-LDA model provides an elegant way to measure the similarity of two communities by calculating the corresponding KL (Kullback-Leibler) distance and convert it to similarity measure. KL divergence is a distance measure for two distributions and the corresponding formula for calculating the distance between two communities ιi and ιj is:
This paper describes an LDA (latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA(Simple Social Network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical models and defined as distributions over social actor space. The advantage of SSN-LDA is that it only requires topological information as input.
SIW(vi, wij)是行動者vi與另一位行動者wij間的互動情形,mi是與vi互動的行動者數目。
在利用LDA模型進行社群發現時,將每個行動者的社會互動特徵視為是原先模型中的文件,這些行動者發生的互動則是模型中的詞語,值得一提的是在社會網路上行動者間的互動順序是可以交換的(exchangeable),因此LDA模型相當適合這種情形。下圖是這個模型的圖示,
對行動者vi而言,其社會互動特徵sipi的產生過程(generative process)如下
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)
也就是
1) 從Dir(β)中產生行動者的組合ϕk來代表K個社群中每一個社群;
2) 從Dir(α)中選擇一個社群的組合θi,做為所有與vi互動的行動者所可能屬於的社群的分布情形;
3) 從Poisson(ξ)中選擇社會互動的個數Ni ;
4) 產生每一個社會互動 wij時:
a) 從Multinomial(θi)中選擇一個社群ιij;
b) 從Multinomial(ϕιij)中選擇一個社會互動 wij。
根據這個模型,行動者vi的社會互動特徵sipi中的第j個社會互動元素wij為wm的機率是
此處θi是sipi 的混合比變數(mixing proportion variable),而ϕk是第k個社群成分分布(component distribution)的參數集合。
給定超參數(hyperparameter)α和β後,所有已知和隱藏變數的聯合機率(joint probability)為
由於要精確地推論LDA模型的變數一般而言是很困難的(intractable),因此經常以 變異期望值最大化(variational expectation maximization) [17], 期望值延遲 (expectation propagation) [24]和 Gibbs取樣 (Gibbs sampling) [25], [26], [22]等三種方法取得近似解。本研究採用的方法是Gibbs取樣,以Markov鏈 Monte Carlo模擬(Markov-chain Monte Carlo simulation)的方式,去逼近ϕk,w和θm,k等變數
本研究針對CiteSeer和NanoSCI兩個書目資料集中的作者合著網路(co-authorship networks)上最大的成分進行社群發現,其中CiteSeer部分共包含249866個節點,而NanoSCI則有203762個節點。本研究並且提出了01-SIP、012-SIP和 k-SIP等三種描述行動者間的社會互動情形,01-SIP是描述兩個作者有直接合著的社會互動情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為1,否則便設為0;012-SIP則考慮兩個作者沒有直接合著但曾經分別和第三位作者合著的情形,如果兩個作者曾經合著至少一篇論文,便將他們的互動情形設為2,雖然這兩位作者不曾合著,但曾有分別和第三位作者合著,便將互動情形設為2,否則便設為0;k-SIP則描述兩位作者多次合著的情形,k是他們合著的次數。研究結果發現在複雜度(perplexity)與各產生社群的緊密度(compactness of communities)等指標,012-SIP的成效都比其他兩者好,從社群發現的結果可以發現許多社群裡的作者是屬於相同的研究機構或是彼此具有相似的研究興趣,因此會共同研究並且合作發表而成為社群。
This model is evaluated on two research collaborative networks: CiteSeer and NanoSCI. The experimental results demonstrate that this approach is promising for discovering community structures in large-scale networks.
An important task in these emerging networks is community discovery, which is to identify subsets of networks such that connections within each subset are dense and connections among different subsets are relatively sparse.
Unlike those previous community discovery studies, we design a hierarchical Bayesian network based approach, namely SSN-LDA (Simple Social Network-LDA) to discover probabilistic communities from social networks. ... In this model, communities are modeled as latent variables and are considered as distributions on the entire social actor space.
We also propose three different approaches to create social interaction profiles based on the social interaction information in the network.
Girvan et al extended this measure to edges and designed a clustering algorithm which gradually remove the edges with highest betweenness value [5]. ... However, a major problem with this approach is that the complexity of this approach is O(m2n), where m is the number of edges in the graph and n is the number of vertices in the network.
The graph partition problem can be formulated as the balanced minimum cut problem where the goal is to find an optimal graph partition so that the edge weight between the partitions is minimized while maintaining partitions of a minimal size. The NP-complete complexity of this approach [15] requires approximate solutions. Flake et al developed approximate algorithms to partition the network by solving s-t maximum flow techniques [2], [3]. The main idea behind maximum flow is to create clusters that have small inter-cluster cuts and relatively large intra-cluster cuts.
The major difference between SSN-LDA approach and the aforementioned approaches is that SSN-LDA is a mixture-model based probabilistic approach. Each community weighs in the contributions from every social actors and this property can be exploited in many potential applications that will be introduced in Section VI.
LDA model was first introduced by Blei for modeling the generative process of a document corpus [17].
Among these variants of LDA models, the approaches proposed in [20], [10] are both concerned about the authors of the documents in the corpus. In particular, Zhou et al introduced a community latent variable in their graphical model and applied it to discover community information embedded in document corpus. This approach can discover the underlying social network based on social interactions and topical similarity.
In their follow-up work[23], Zhou et al. investigated how research topics evolve over time and attempted to discover the most influential researchers involved in such transitions.
Each actor is characterized by its social interaction profile (SIP), which is defined as a set of neighbor(wij) and the corresponding weight(SIW(vi, wij)) pair.
where mi is the size of vi's social interaction profile.
Note that we consider the social interaction elements in this profile are exchangeable and therefore their order will not be concerned. It is this exchangeability that permits the application of LDA model [17].
Subsequently, we specify that a social network contains a set of communities ι (ι1, ι2, …, ιk) and each community in ι is defined as a distribution on the social actor space. In SSN-LDA, community assignments are modeled as a latent variable (ι) in the graphical model. The community proportion variable (θ) is regulated by a Dirichlet distribution with a known parameter α. Meanwhile, each social actor belongs to every community with different probabilities and therefore its social interaction profiles can be represented as random mixtures over latent communities variables.
The SSN-LDA model for social network analysis is illustrated in Fig. 1. Note that SSN-LDA resembles topic-based LDA model[17], with the social network being analogous to the corpus, the social interaction profiles being analogous to documents; and the occurrence of social interactions being analogous to words.
The distribution of topics in documents and the terms over topics are two multinomial distributions with two Dirichlet priors, whose hyperparameters are α and β respectively.
The dimensionality K of the Dirichlet distribution, which is also the number of community component distributions, is assumed to be known and fixed.
This generative process for an agent(wi)'s social interaction profile sipi in a social network is:
1) Sample mixture components ϕk ~ Dir(β) for k (belongs) [1, K]
2) Choose θi ~ Dir(α)
3) Choose Ni ~ Poisson(ξ) (note that Poisson assumption is not critical to this model)
4) For each of the Ni social interactions wij :
(a) Choose a community ιij ~ Multinomial(θi);
(b) Choose a social interaction wij ~ Multinomial(ϕιij)
According to the model, the probability that the jth social interaction element wij in the social actor wi's social interaction profile sipi instantiates a particular neighboring agent wm is:
where θi is the mixing proportion variable for sipi and ϕk is the parameter set for the kth community component distribution.
Given the hyperparameters α and β, the joint distribution of all known and hidden variables is:
Exact inference is generally intractable for LDA model. There have been three major approaches for solving this model approximately, including variational expectation maximization [17], expectation propagation [24], and Gibbs sampling[25], [26], [22].
Gibbs sampling is a special case of Markov-chain Monte Carlo (MCMC) simulation[27] where the dimension K of the distribution are sampled alternately one at a time, conditioned on the values of all other dimensions[22].
we apply the Gibbs sampling algorithm that has been introduced in [26], [22] to solve the SSN-LDA model and reduce the computation requirement.
Subsequently, the update equation for the hidden variable can be derived [22]:
where n(.)⌐i is the count that does not include the current assignment of ιi and recall that sip is the variable for social interaction profiles. For the sake of simplicity, we assume that the Dirichlet distribution is symmetric in deriving the above formula.
Finally, the update formula for ϕk,w and θm,k are as follows:
In co-authorship networks, the vertices represent researchers and the edges in the network represent the collaboration relation between researchers. In this section we evaluate SSNLDA model on co-authorship networks collected from two distinct areas: computer science(CiteSeer) and nanotechnology(NanoSCI). Note that no name disambiguation has been done on either dataset.
The size of the largest connected subnetwork of CiteSeer is 249866 while the size of the largest connected subnetwork in NanoSCI is 203762. In this paper, we are only interested in discovering community structures in the two largest subnetworks.
In this paper, we explore three different types of social interaction pro le representations for social networks, namely 01-SIP, 012-SIP, and k-SIP.
In the 01-SIP approach, an edge is drawn between a pair of scientists if they coauthored one or more articles. Collaborating multiple times does not make a difference in this model
In order to mitigate this problem, we propose a 012-SIP model which takes a node's neighbors' neighbors into consideration. ... In this model, we distinguish a node's direct neighbors from its neighbors' neighbors by giving different weights to them.
This section describes a K-SIP model where the weight information for an edge is defined as the times of the collaboration between the two authors.
And then, 10% of the original datasets is held out as test set and we run the Gibbs sampling process on the training set for i iteration. In particular, in generating the exemplary communities, we set the number of the communities as 50, the iteration times i as 1000. In perplexity computation, i is set as 300 in order to shorten the computation time. In both case, α is set as 1/K and β is set as 0.01, where K is the number of the communities.
Table III shows 6 exemplary communities from a 50-community solution for the CiteSeer dataset with social interaction profiles being created using 012-SIP representation. ...These exemplary communities give us some flavor on the communities that can be discovered by this approach. Specifically, we observe that some communities are institution-based, some others are topic-based. ... This observation reveals the fact that researchers from same institution or with similar research interests tend to collaborate together more and build closer social ties.
Perplexity is is a common criterion for measuring the performance of statistical models in information theory. It indicates the uncertainty in predicting the occurrence of a particular social interaction given the parameter settings, and hence it reflects the ability of a model to generalize unseen data.
Perplexity PP is defined as
where wm is the social interaction profiles in the test set and
where n(v)m is the number of times term t has been observed in document m.
It shows that the perplexity value is high initially and decreases when the number of communities increases. In addition, the results show that the 012-SIP approach has lower perplexity value than the other two approaches.
Compactness of a community is measured through the average shortest distance among the top-ranked Nr researchers in this community. Short average distance indicates a compact community.
The t-test results show that the 012-SIP approach is significantly better the other two approaches for both datasets
The probabilities that can be derived from SSN-LDA model can be helpful in determining the importance and roles of community members. For instance, the importance of community members conditioned on the community variable ιj can be measured through the probability p(wi|ιj ), which can be easily derived from this model based on the learned ϕ.
SSN-LDA model provides an elegant way to measure the similarity of two communities by calculating the corresponding KL (Kullback-Leibler) distance and convert it to similarity measure. KL divergence is a distance measure for two distributions and the corresponding formula for calculating the distance between two communities ιi and ιj is:
This paper describes an LDA (latent Dirichlet Allocation)-based hierarchical Bayesian algorithm, namely SSN-LDA(Simple Social Network LDA). In SSN-LDA, communities are modeled as latent variables in the graphical models and defined as distributions over social actor space. The advantage of SSN-LDA is that it only requires topological information as input.
沒有留言:
張貼留言