2013年3月14日 星期四

Chen, P., & Redner, S. (2010). Community structure of the physical review citation network. Journal of Informetrics, 4(3), 278-290.


Chen, P., & Redner, S. (2010). Community structure of the physical review citation network. Journal of Informetrics4(3), 278-290.

network analysis

本研究針對Physical Review的論文引用情形,建立論文引用網絡,利用社群偵測(community detection)方法分析物理研究的次領域(subfields),並且探討各主要次領域的規模大小、發展時間和引用影響(citation impact)等量化特性。利用社群偵測方法是根據在同一個次領域內的論文之間有高密度引用而在次領域之間的引用較稀疏的概念。上述的概念便是社群偵測演算法常利用的模組性(modularity) (Newman, 2005),當某一個網絡的劃分情形會使得社群內的節點彼此有高度的連結而社群間的連結較少時,便會得到較大的模組性,根據Danon, Duch, Diaz-Guilera, & Arenas (2005)和Guimera & Amaral (2005)的主張,模組性大於等於0.3,便可以認為是具有社群的結構,如果是一個隨機網絡(random network),其模組性的測量值為0,網絡劃分的效果愈好,其測量得到的模組性愈大(Newman & Girvan, 2004)。因此,本研究便以社群偵測演算法,使劃分的引用網絡得到較大的模組性,純粹引用網絡的結構來發現物理研究的次領域。本研究以下面的演算法利用模組性的測量劃分引用網絡:
1. 計算某一個次群體的模組性。
2. 如果特徵值(eigenvalue)的值為0或負數,則此次群體無法劃分。
3. 如果特徵值(eigenvalue)的值為正數,計算這個次群體劃分後的整體模組性。
4. 如果整體模組性增加,便將此次群體進行劃分;否則便不劃分,並將此次群體列為不可劃分,然後尋找另一個次群體重新進行上述的步驟。
重複步驟1到4,直到無法找到可以劃分的次群體為止,此時整體模組性便會得到最大的值。
在1893年到2007年之間的Physical Review論文當中,被引用次數超過100的論文共有2920筆,在這些高被引的論文每一篇彼此間的平均引用數為11.749。經由上述的社群偵測演算法處理,共得到274個群體,模組性的最大值為0.543。


For PR (Physical Review) publications, we ask: can articles be naturally grouped into distinct subfields, with a high density of citations among papers within a given subfield and sparser citations across subfields? By the very nature of physics research and also as revealed by the data, this partitioning into subfields is self-evident. While anecdotal information exists about the identity and evolution of some of the more prominent subfields of physics (Capri, 2007), here we determine their quantitative properties, such as their size, time history, and citation impact.
More recent work has led to the formulation of new and powerful methods to detect communities in complex networks, both with undirected (Herrera et al., 2009; Kim, Son, & Jeong 2009; Lancichinetti, Fortunato, & Kertesz, 2009; Leicht & Newman, 2008; Porter et al., 2010; Rosvall & Bergstrom, 2008) and directed (Leicht & Newman, 2008) links. A systematic review of these developments is given in (Porter et al., 2010).
One particularly useful approach exploits the concept of modularity (Newman, 2005). Compared to the earlier community detection methods, the use of this metric to identify communities requires no extra knowledge beyond the network structure itself, involves no subjective judgments, and can be applied to any type of network.
To detect communities within networks, we want to determine sets of vertices that are more strongly connected to each other but less connected to the rest of the network. For this purpose, we use the modularity Q (Newman & Girvan, 2004) ...
The modularity Q gives the difference between the number of links between groups in the actual network and the expected number of links between these same groups in an equivalent random network with the same link density. A modularity Q= 0 corresponds to a random network, in which two nodes are connected with probability that is proportional to their respective degrees. Empirical data indicates that a modularity value Q>=0.3 is indicative of true community structure (Danon, Duch, Diaz-Guilera, & Arenas, 2005; Guimera & Amaral, 2005), and the largest modularity that has been observed in real-world examples is 0.7 (Newman, 2005).
The steps to detect communities in a directed network at some intermediate stage of division therefore are:
1. Calculate the modularity for a subgroup.
2. If the leading eigenvector (eigenvalue?) is negative or zero, the subgroup is indivisible.
3. If the leading eigenvector (eigenvalue?) is positive, calculate the modularity of the entire network, assuming that the division is applied.
4. If the global modularity increases, perform the division. If not, abandon the division, mark this subgroup as indivisible, and process another divisible subgroup.
Repeat steps 1–4 until Q reaches its maximum (or equivalently, all subgroups are as indivisible).
To check the robustness of the results,we also apply a recently introduced bottom-up algorithm (Blondel et al., 2008). Here each network node is initially assigned to a distinct community. Then, for each node i, the change in modularity is calculated after provisionally assigning i to be in the community of one of its neighbors. Node i is then assigned to the community that maximizes the increase of the modularity. If there is no increase, then node i remains in its original community. Each network node is processed in this way, and this operature is iterated for each community in the network, until a maximal modularity is achieved. This algorithm is computationally more efficient than the previous top–down approach, but has the disadvantage of requiring considerably more computer memory.
Our PR citation network consists of 433,452 articles published between 1893 through August 2007 with at least one citation (the nodes) and 4,370,203 total citations among these publications (the links) at any time during this 114-year period. To keep the scope manageable we restrict ourselves to well-cited PR publications, defined as those with more than 100 citations. This restriction reduces the network to N= 2920 publications and L = 11,749 citations. All citations to publications outside this highly cited set and citations from “external” PR publications to this highly cited set are excluded.
At the end of the modularity maximization procedure, there are 274 distinct communities, and the network modularity is Q= 0.543 (Q= 0.514 if RMP papers are included). For these 274 communities, the largest has 191 members (publications) and the smallest has only a single member. The 10 largest communities (listed in Table 1) contain 1369 publications and comprise 46.9% of the highly cited subnetwork.
By the nature of the partitioning into communities, the links that remain between communities at the end of modularity maximization should be weak. In fact, only 17 out of 393 crosslinks have a weight that exceeds 0.01 (right side of Fig. 2). Moreover, 16 out of these remaining crosslinks join communities within the same major groupings that emerged in the initial few division steps.
The large difference between the inter- and intra-community link weights suggest that the partitioning that results from modularity maximization is meaningful.
The individual communities within the PR citation network have a wide range of structures, ranging from tightly knit to barely classifiable as a single entity. To illustrate this diversity,we again focus on the most prominent PR citation communities that contain 5 or more publications (Table 7). Applying modularity maximization to each community separately, we find modularity values that range from 0.16 to 0.50 for the communities that contain more than 25 publications.

沒有留言:

張貼留言