2013年5月13日 星期一

Shibata, N., Kajikawa, Y., Takeda, Y., and Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28, 758-775.


Shibata, N., Kajikawa, Y., Takeda, Y., and Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28, 758-775.

本研究首先將領域相關的論文依據它們之間的引用情形視為是一個網路圖,以每一篇論文為網路圖上的一個節點,並將某一篇論文與其引用的其他論文建立連結,再利用Newman(2004)提出的演算法以網路圖的型態(topology)將節點進行叢集,使得產生的節點集合內有較密集的引用關係,集合之間的引用關係則較稀疏。由於通常論文會引用研究主題與其相關的論文為參考文獻,可以從一群彼此間有密集引用關係的論文中發現它們共同的研究主題。找出可以視為是研究主題的集合後,接著計算集合內論文的平均年份(average age)和彼此間的關係,每一個集合並且利用tf*idf,找出特徵值較大的詞語做為集合相關主題的標示,然後針對論文中被引用次數較多者計算它們的集合內程度(within-cluster degree)z和參與係數(participation coefficient)P。z的數值由論文對映節點與同一集合其他節點的連結數經過z-score正規化後產生,如果論文具有較大z值表示這篇論文與同一集合的其他論文之間有較多的引用關係;P值則是表現論文連結的集合數,P愈大反應連結的集合愈多。本研究以GaN和複雜網路(complex network, CN)兩個領域為分析的案例,從這兩個領域前十篇引用較多的論文的集合內程度z和參與係數P的計算結果發現:GaN的z和P都有較大的數值,CN雖然有較大的z值,但P值較小,也就是GaN的被引用數較多的論文除了多數會連結到集合內的其他論文外,也會連結到其他集合,但是CN的論文則大多只連結到本身的集合,因此可以推論GaN的研究屬於增進式創新(incremental innovation),而CN則是分支式創新(branching innovation)。

information visualization

We divided citation networks into clusters using the topological clustering method, tracked the positions of papers in each cluster, and visualized citation networks with characteristic terms for each cluster. Analyzing the clustering results with the average age and parent–children relationship of each cluster may be helpful in detecting emergence. In addition, topological measures, within-cluster degree z and participation coefficient P, succeeded in determining whether there are emerging knowledge clusters.
There were at least two types of development of knowledge domains. One is incremental innovation as in GaN and the other is branching innovation as in complex networks. In the domains where incremental innovation occurs, papers changed their position to large z and large P. On the other hand, in the case ofbranching innovation, they moved to a position with large z and small P, because there is a new emerging cluster, and active research centers shift rapidly. Our results showed that topological measures are beneficial in detecting branching innovation in the citation network of scientific publications.
Massini et al. (2005) discussed the difference between pioneers (innovators) and adopters (imitators). For innovators and early adopters, it is essential to detect emerging research fields promptly before other competitors enter the research domain.
In fact, Sorenson and Fleming observed that patents that refer to scientific materials receive more citations (Sorenson and Fleming, 2004; Fleming and Sorenson, 2004). This partially supports the hypothesis that scientific publications play an important role in accelerating technological innovation.
Therefore, for both R&D managers in companies or research institutions and policy makers, noticing emerging research domains among numerous academic papers has become a significant task. However, such a task becomes highly laborious and difficult as each research domain becomes specialized and segmented.
There are two approaches to detecting emerging research domains and the topics discussed there (Kostoff and Schaller, 2001).
One straightforward manner is the expert-based approach, which utilizes the explicit knowledge of domain experts. However, it is often time-consuming and is also subjective in the current information-flooded era.
Another is the computer-based approach, which is compatible with the scale of information, and it is therefore expected to complement the expert-based approach. There is a commensurate increase in the need for scientific and technical intelligence to discover emerging research domains and the topics discussed there, even for unfamiliar domains (van Raan, 1996; Kostoff et al., 1997, 2001; Losiewicz et al., 2000; Boyack and Boner, 2003; Porter, 2005; Buter et al., 2006).
The temporal patterns of co-cited clusters are usually tracked to detect emerging fields with a variety of visualization techniques.
The multidimensional scaling (MDS) plot on a two-dimensional (2-D) plane is a typical example of such visualizations (Small, 1977). However, spatial configurations in MDS do not show links explicitly.
There are number of efforts to improve the efficiency of visualization such as a self-organizing map (SOM) (Skupin, 2004) and a pathfinder network (PFNET) (Chen, 1999, 2004). White et al. (2004) compared these two visualization techniques and noted that while PFNETs seem to be directive about relationships, SOMs are merely suggestive.
However, it causes two problems. One is the deficiency of relevant papers. It is not always true that a research domain can be represented by a single keyword. Another is the surplus of papers. In some cases, the same keyword is used in different research domains, which includes the noisy papers to the corpus.
To overcome the first problem, we use broad queries to retain wide coverage of citation data. For the second problem, we analyze only the maximum component of the citation networks. By doing this step, non-relevant papers that do not cite papers in the corresponding research domain are removed.
After extracting the maximum component, we perform the topological clustering, in order to discover tightly knit clusters with a high density of within-cluster edges with Newman’s algorithm (Newman, 2004). With this process, citation networks are divided into clusters, within which papers cite densely each other.
In the last step, two topological measures, within-cluster degree, zi, and participation coefficient, Pi, proposed byGuimera and Amaral (2005) are calculated in order to track the position of each paper in the clustered citation network.
As a result, we obtained the data of 15,134 papers on GaN and 7370 papers on CN that had been published from 1970 to 2004.
Additionally, the analysis of intercitation is more straightforward than co-citation. Klavans and Boyack (2006) compared the similarity of the clustering results by intercitation to that by co-citation. They concluded that intercitation is more appropriate for the clustering of the similar documents. Intercitation also allows us to group papers that are only rarely cited, which is a significant portion of all papers (Hopcroft et al., 2004).
Amongst many clustering methods and algorithms, in this paper we apply a method proposed by Newman which is able to deal with large networks with relatively small calculation time in the order of O((m+n)n), or O(n2) on a sparse network, with m edges and n nodes; therefore, this could be applied to large-scale networks (Newman, 2004).
The algorithm proposed is based on the idea of modularity. Q=Tr(e)-||e||2, ... The first part of the equation, Tr(e), represents the sum of density of edges within each cluster. A high value of this parameter means that nodes are densely connected within each cluster.... The second part of the equation, ||e||2, represents the sum of density of edges within each cluster when all edges are placed randomly. ...  Q is the fraction of edges that fall within communities, minus the expected value of the same quantity if the edges fall at random without regard for the community structure.
A high value of Q represents a good community division where only dense-edged remain within clusters and sparse edges between clusters are cut off, and Q = 0 means that a particular division gives no more within-community edges than would be expected by random chance.
After dividing the papers into optimized clusters using Newman’s method, the role of each paper is determined by its within-cluster degree and its participation coefficient, which define how the node is positioned in its own cluster and between clusters (Guimera and Amaral, 2005). ... Within-cluster degree zi measures how ‘‘well connected’’ node i is to other nodes in the cluster... Participation coefficient Pimeasures how ‘‘well distributed’’ the edges of node i are among different clusters.
According to the within-cluster degree, they classified nodes with z>=2.5 as hub nodes and nodes with z<2.5 as non-hub nodes.
In addition, non-hub nodes can be naturally divided into four different roles:
(R1) ultra-peripheral nodes; that is, nodes with most of their edges within their cluster (P<0.05),
(R2) peripheral nodes; that is, nodes with many edges within their cluster (0:05<P<=0:62);
(R3) non-hub connector nodes, that is, nodes with a high proportion of edges to other clusters (0:62<P<=0:80);
and (R4) non-hub kinless nodes, that is, nodes with edges homogeneously distributed among all clusters (P>0.80).
Similarly, hub nodes can be classified into three different roles:
(R5) provincial hubs, that is, hub nodes with the vast majority of edges within their cluster (P<0.30);
(R6) connector hubs, that is, hubs with many edges to the other clusters (0.30<P<=0.75);
and (R7) kinless hubs, that is, hubs with edges homogeneously distributed among other clusters (P>0.75).
In GaN, where incremental innovation occurred, the top 10 papers changed position from (R2) peripheral nodes to (R6) connector hubs as the domain developed.
However, in CN, where branching innovation occurred, the top 10 papers moved from (R1) ultra-peripheral nodes to (R5) provincial hubs, and became provincial hubs.
For both R&D managers in companies or research institutions and policy makers, there are two types of approaches, i.e., expert-based and computer-based approach to notice emerging research domains among numerous academic papers. However, the former approach becomes a highly laborious and difficult task as each research domain becomes specialized and segmented.
Our computer-based method, at least, complements this expert-based approach for the following three reasons.
First of all, experts’ judgment is not always right, especially in the current information-flood era. Sometimes, once-humble researchers accomplish great scientific achievements. Experts may fail to give credit to emerging trends.
Second, gathering experts is expensive. Identifying the quality of these papers before they become a new emerging cluster requires numerous experts.
Finally, our method is scalable. Even if the publication cycle becomes shorter and the number of publications grows, the computer-based approach could be effective.
Moreover, although the previous researches in knowledge mapping emphasized on the method of visualization in order to detect emergence, our method enables us to detect by monitoring variables, such as z and P. When we use visualization, we must judge the emergence of research cluster by the visualized map itself. Utilization of quantitative variables such as z and P open a way to detect it by machine-friendly manner.
In domains where incremental innovation occurs, hub papers are connector hubs with large z and large P. On the other hand, in the case of branching innovation, there is a new emerging cluster and active research centers shift rapidly and hub papers become provincial hubs with large z and small P.
This means that in the case of GaN, hub papers have intercluster edges, which connect some clusters; however, in the case of CN, hubs connect mainly in their own clusters and have few intercluster edges.
In the detection of emerging research domains, the shortcoming of this approach is the existence of time lag. It takes 1 or 2 years until a paper receives citations from other papers. It also takes 1 or 2 years from the completion of research to the publication of the research. Therefore, in the context of TIM and research policy, policy makers should complement this approach with not-published information such as academic conference and expert opinion.

沒有留言:

張貼留言