2013年5月15日 星期三

Quirina, A., Cordóna, O., Vargas-Quesadab, B., Moya-Anegón, F. (2010). Graph-based data mining: A new tool for the analysis and comparison of scientific domains represented as scientograms. Journal of Informetrics, 2010, 291-312.


Quirina, A.,  Cordóna, O., Vargas-Quesadab, B., Moya-Anegón, F. (2010). Graph-based data mining: A new tool for the analysis and comparison of scientific domains represented as scientograms. Journal of Informetrics, 2010, 291-312.

network analysis
本研究利用基於圖形的資料探勘(graph-based data mining)方法,嘗試從數個科學圖表(scientograms)中發現它們之間共同的次結構(substructures),藉以分析以下的三個問題:1)全國科學研究領域的逐年進展情形,2) 世界各國共同研究類別次結構(research categories substructures)的抽取以及3)不同國家間科學研究領域的比較。本研究首先對分析的73個國家每一年的科學研究進行研究類別的共被引分析,建構以每一個研究類別為節點、類別間的共被引相關程度為連線的網路圖做為本研究探討的科學圖表,每一條連線的權重計算方式為CM(ij) = Cc(ij) + Cc(ij)/sqrt(c(i)*c(j)),其中c(i)和c(j)分別是i與j兩個類別的被引次數,Cc(ij)則是i與j兩個類別的共被引次數。當某兩對類別的共被引次數不相等時,由於 Cc(ij)/sqrt(c(i)*c(j))的值介於0與1之間,此一權重計算方式對於共被引次數較大者可以得到較大的數值;當兩對類別的共被引次數相等時,兩個類別的被引次數接近於共被引次數的情況下,可以得到較大的數值。然後將建立起來的網路圖經過尋徑者網路(pathfinder networks)演算法進行維度縮減處理(dimensionality reduction)。進行處理時,將尋徑者網路演算法的參數r和q,分別將r和q設定為r =∞ , q = n − 1。對於任意兩個節點間的連結線,如果除了此一連結線,以這兩個節點為端點的任何的一條路徑,其上的某一個連結線的權重大於此連結線的權重,則此連結線的重要性較不顯著(significant),便於網路圖上刪除此連結線。網路圖可以利用Kamada–Kawai等佈局演算法(layout algorithm),將節點與其之間由線連結所表現的關係呈現為視覺化圖像。本研究採用Subdue演算法(Cook & Holder, 1994, 2000)抽取各網路圖共同的次結構,Subdue演算法的運算原理是最小描述長度(minimum description length) (Rissanen, 1989)。將網路圖上的節點、連結線以及節點間的連結關係表示成位元串(bit string),位元串的長度總和便是網路圖的描述長度。假設各網路圖的集合為G,描述G的位元串的長度總和為I(G),並假定出現在多個網路圖中的各次結構為S,如果描述S的位元串的長度總和為I(S),將各網路圖上出現的此次結構以一節點取代的情況時,描述G的位元串的長度總和成為I(G|S)。最小描述長度的原理便是發現各個次結構使得I(S)+I(G|S)的值最小,實際上以valueMDLi(S,G) = I(G) / (I(S) + I(G|S))求取最大值。在評估各次結構時,除了求得最小描述長度(也就是最大的valueMDLi),另外次結構的選擇以規模(size)較大、支持度(support)較大者也是比較具有代表性的次結構。規模以網路圖與次結構中包含的節點和連結線總數來計算,評估方式為valuesize(S,G) = Size(G) / (Size(S) + Size(G|S))。支持度則是次結構被包含於多少網路圖內來計算,評估方式為valuesupport (S,G) = #graphs in G including S / card(G),card是網路圖的數目。如果考慮負圖表的情形,也就是希望次結構出現於負圖表愈少愈好,則描述長度的評估方式成為valueMDLi(S,Gp,Gn) = (I(Gp) + I(Gn)) / (I(S) + I(Gp|S) + I(Gn) − I(Gn|S)),其中Gp和Gn分別代表正圖表與負圖表;規模的評估方式成為valuesize(S,Gp,Gn) = (Size(Gp) + Size(Gn)) / (Size(S) + Size(Gp|S) + Size(Gn) − Size(Gn|S));支持度則成為valuesupport (S,Gp,Gn) = (#Gp graphs including S + #Gn graphs not including S) /  (card(Gp) + card(Gn))。以下分別說明本研究探討的三個問題的實驗方式與結果:
1)全國科學研究領域的逐年進展情形
本研究以年度為單位,要分析的年度所產生的網路圖為正圖表,先前的年度為負圖表,發現顯著包含於正圖表內的次結構。
2) 世界各國共同研究類別次結構的抽取
以世界各國為單位,所有的國家所產生的網路圖都視為是正圖表,發現顯著包含於正圖表內的次結構。
3)不同國家間科學研究領域的比較
以世界各國為單位,要分析的國家所產生的網路圖為正圖表,其餘國家的網路圖為負圖表,發現顯著包含於正圖表內的次結構。
In this paper, we aim to show that graph-based data mining tools are useful to deal with scientogram analysis. Subdue, the first algorithm proposed in the graph mining area, has been chosen for this purpose. This algorithm has been customized to deal with three different scientogram analysis tasks regarding the evolution of a scientific domain over time, the extraction of the common research categories substructures in the world, and the comparison of scientific domains between different countries.
The visualization of scientific information has long been used to uncover and divulge the essence and structure of science (Börner & Scharnhorst, 2009; Chen, 1999a, 2004).
Yet despite its ripe age, information display is still in an adolescent stage of evolution in the context of its application to scientific domain analysis.
There is a large number of information visualization techniques which have been developed over the last decade within this area (Chen, 1999b; Lucio-Arias & Leydesdorff, 2008; Moya-Anegón et al., 2007, 2005; Small & Garfield, 1985), but none of them has been designed to support the exploration of large datasets. Besides, all the latter approaches require a large amount of expertise from the user, which reduces the chances to automate the analysis procedure. Nevertheless, it is clear that information visualization and visual data mining (Keim, 2002) can provide the theoretical and practical backgrounds to deal with scientific information analysis.
The generation of a big picture is something implicit in the process of visualizing scientific information. In an attempt to sum what has taken place to date up, we can say that nowadays there are two proposals for tracking down the big picture. On the one hand, one can adopt the traditional units of analysis (authors, documents, and journals) and, through their grouping, identify scientific disciplines following a bottom-up process (Boyack & Klavans, 2008; Klavans & Boyack, 2006; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985). On the other hand, the alternative uses the categories of the documents to the same end, and shows the scientific structure from them in a top-down manner (Moya-Anegón et al., 2004).
The former proposal (bottom-up process) presents all the pros of its fine-grained character, but it runs into difficulties in representing the totality of the panorama on a single plane and in tagging the disciplines.
That is, it (top-down process) is relatively simple to represent the scientific structure of a domain on a single plane by means of a maximum of 300 categories and their interrelation, avoiding tagging problems. However, this implies the acceptance of a classification of science in predefined categories, never transparent and always subjective, as well as the fact that documents are classified by the journals in which they are published and not by their content (coarse-grained character).
Current scientogram analysis techniques (Boyack, Börner, & Klavans, 2009; Chen et al., 2009; Klavans & Boyack, 2006; Leydesdorff & Rafols, 2009; Moya-Anegón et al., 2007) aim to provide a fine, detailed, tight view of a scientogram. To do so, they are based on performing a low-level analysis and comparison of the maps. Statistical techniques, computer algorithms, and macrostructure and microstructure techniques for the identification of thematic areas and scientific disciplines have already been used to analyze and compare scientograms (Boyack, Klavans, & Börner, 2005; Chen, 1999b; Lucio-Arias & Leydesdorff, 2008; Moya-Anegón et al., 2007; Wallace, Gingras, & Duhon, 2009).
However, this approach shows a main limitation: only a single or a very reduced set of maps can be analyzed or compared together. In fact, the field lacks an easy-to-use approach allowing the identification and the comparison of scientific structures within scientograms with a higher degree of automation.
Graph-based data mining (GBDM) (Cook & Holder, 2006; Holder & Cook, 2005; Washio & Motoda, 2003) involves the automatic extraction of novel and useful knowledge from a graph representation of data. By ‘novel’ we mean that the knowledge retrieved is not directly encoded in the data but deeply masked in it (hence, it requires to be uncovered), and by ‘useful’ we mean that the discovered patterns have in general an interest for the domain expert ...
In fact, GBDM techniques have been applied for frequent substructure discovery and graph matching in a large number of domains including chemistry and applied biology (Borgelt & Berthold, 2002; Huan et al., 2004), classification of chemical compounds (Deshpande, Kuramochi, & Karypis, 2002), and unsupervised and supervised pattern learning (Cook & Holder, 2006), among many others.
Subdue (Cook & Holder, 1994, 2000) is a graph-based knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. It aims to discover interesting and repetitive substructures in a structural database (DB). For this purpose, the minimum description length (MDL) principle (Rissanen, 1989) is used in order to compress the original DB into a hierarchical and shorter version.
In particular, we will describe how this algorithm can be customized to deal with three different scientogram analysis and comparison tasks regarding the evolution of a scientific domain over time, the extraction of the common research categories substructures in the world, and the comparison of scientific domains between different countries.
The generation of a scientogram following the top-down approach (Moya-Anegón et al., 2004) requires the sequential application of several techniques.
1. Units of analysis: The categories are the units of analysis and representation (Moya-Anegón et al., 2004; Vargas-Quesada & Moya-Anegón, 2007).
2. Unit of measure: ... a co-citation measure CM is computed for each pair of categories i and j as follows:
CM(ij) = Cc(ij) + Cc(ij)/sqrt(c(i)*c(j))
where Cc is the co-citation frequency and c is the citation frequency.
3. Dimensionality reduction: ...  the Pathfinder algorithm (Chen, 1998; Dearholt & Schvaneveldt, 1990) is applied to the co-citation matrix to prune the network. Due to the density of the data, and especially in the case of vast scientific domains with a high number of entities (categories in our case) in the network, Pathfinder is usually parameterized to r =∞ and q = n − 1. This is done in order to preserve and highlight the salient relationships between categories, and for capturing the essential underlying intellectual structure of a scientific domain.
4. Layout: The spring embedder family of methods is the most widely used in the area of Information Science. Spring embedders assign coordinates to the nodes in such a way that the final graph will be pleasing to the eye, and that the most important elements are located in the center of the representation (also called its backbone). Kamada–Kawai’s algorithm (Kamada & Kawai, 1989) is one of the most extended methods to perform this task. Starting from a circular position of the nodes, it generates networks with aesthetic criteria such as the maximum use of available space, the minimum number of crossed links, the forced separation of nodes, the build of balanced maps, etc.

The need of mining structural data to uncover objects or concepts that relates objects (i.e., subgraphs that represent associations of features) has increased in the past ten years, thus creating the area of GBDM (Holder & Cook, 2005; Washio & Motoda, 2003). Nowadays, GBDM has become a very active area and several techniques such as Subdue, the Apriori family of methods (Apriori-based GM (Inokuchi, Washio, & Motoda, 2000), Frequent Subgraph Discovery (Kuramochi & Karypis, 2001), JoinPath (Vanetik, Gudes, & Shimony, 2002), etc.), and the Frequent Pattern-growth family of methods (CloseGraph (Yan & Han, 2003), FFSM (Huan, Wang, & Prins, 2003), Gaston (Nijssen & Kok, 2004), gSpan (Yan & Han, 2002), MoFa/MoSS (Borgelt & Berthold, 2002), Spin (Huan, Wang, Prins, & Yang, 2004), etc.) have been proposed to deal with problems such as graph matching, graph visualization, frequent substructure discovery, conceptual clustering, and unsupervised and supervised pattern learning (Cook & Holder, 2006).
Among them, we can highlight Subdue (Cook & Holder, 1994, 2000), a graph-based knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. ... It is able to develop graph shrinking as well as frequent substructure extraction and hierarchical conceptual clustering.
Subdue (Cook & Holder, 1994, 2000) is a method for discovering interesting and repetitive substructures in a structural DB. The algorithm uses the MDL principle (Rissanen, 1989) to discover frequent substructures in a DB, extract them and replace them by a single node in order to compress the DB. These extracted substructures represent structural concepts in the data.
The Subdue algorithm can be run several times in a sequence in order to extract meta-concepts from the previously simplified DB. After multiple Subdue runs on the DB, we can discover a hierarchical description of the structural regularities in the data (Jonyer, Cook, & Holder, 2001).
Subdue can also use background knowledge, such as domain-oriented expert knowledge, to be guided and to discover substructures for a particular domain goal.
Subdue uses a variant of beam search (Lowerre, 1976) in order to avoid exponential-sized queue: at each step, only BeamWidth new children from a given parent are explored (see line 14). Furthermore, only a maximum of MaxBest substructures having a maximal size of MaxSubSize are returned to the user, and the algorithm does not develop more than Limit iterations (see line 6). These parameters ensure that the running time of Subdue is polynomial and is actually constrained by the BeamWidth and the Limit parameters (Jonyer et al., 2001).
The evaluation of a substructure (see line 13) can be computed by the MDLmeasure (see Section 3.1.1), the Size-measure (see Section 3.1.2), or the Support-measure (see Section 3.1.3).
The MDL of a graph is the necessary number of bits for describing completely the graph. This number of bits is usually given by the value I(S), the number of bits required to encode the substructure S. I(S) is computed as the sum of the number of bits to encode the vertices of S, the number of bits to encode the edges of S, and the number of bits to encode the adjacency matrix describing the graph connectivity of S. Subdue looks for the substructure S minimizing I(S) + I(G|S), where G is the input graph, I(S) is the number of bits required to encode the uncovered substructure, and I(G|S) is the number of bits required to encode the graph obtained by compressing G with S, i.e., substituting each occurrence of S in G by a single node (Holder, Cook, & Djoko, 1994).
In the following, we renamed the MDLi measure (’i’ stands for inverse) as we are maximizing its value: Subdue considers a given substructure S is better than another one S if the MDLi measure valueMDLi(S,G) is higher than valueMDLi(S,G), where valueMDLi(S,G) is computed as follows: valueMDLi(S,G) = I(G) / (I(S) + I(G|S))
However, the alternative operation mode for Subdue considers two distinct sets, a positive set Gp and a negative set Gn, determined by the user. In this operation mode, the goal of Subdue is to find the largest substructures present in the maximum number of graphs in the positive set, which are not included in the negative set. The MDLi measure is thus computed as follows:
valueMDLi(S,Gp,Gn) = (I(Gp) + I(Gn)) / (I(S) + I(Gp|S) + I(Gn) − I(Gn|S))
The size of an object is not computed from the description length, but from an index based on either the number of nodes, the number of edges or, more usually, the sum of the both values. This measure is faster to compute but less consistent as it does not show the real benefit obtained after the compression of the DB.
valuesize(S,G) = Size(G) / (Size(S) + Size(G|S))  where, usually, Size(G) = #vertices(G) + #edges(G).
In the case of the second operation mode, in which we have a positive and a negative scientogram set, the Size measure is computed as follows:
valuesize(S,Gp,Gn) = (Size(Gp) + Size(Gn)) / (Size(S) + Size(Gp|S) + Size(Gn) − Size(Gn|S))
The last alternative measure is based on the support of substructure S and it is expressed as follows:
valuesupport (S,G) = #graphs in G including S / card(G) with card(G) being the cardinal (cardinality ?) of the set of graphs G composing the DB.
For the second operation mode, this evaluation measure is computed as the sum of the number of positive maps containing S and the number of negative maps not containing S, divided by the total number of maps. Its formulation is as follows:
valuesupport (S,Gp,Gn) = (#Gp graphs including S + #Gn graphs not including S) /  (card(Gp) + card(Gn))

We consider a substructure having a larger positive support and a smaller negative support as having a better quality. In the same way, substructures having a larger size are preferred over smaller ones as they are more specific.

Study of the evolution of the scientific domain of a specific country over time
An information science expert would be interested in knowing which substructures appear in the analyzed domain, at which time, how big they are, how many they are, where are they located, and so forth. This will allow him to perform at least two kind of studies. On the one hand, an in-deep analysis of the uncovered substructures themselves, which kind of categories are they linking, etc. On the other hand, global statistics about the size and the quantity of these substructures to respectively characterize the importance of the evolution of the domain and its dynamics.
Thus, the goal of the first analysis task is to present a framework for the study of the evolution of a scientific domain over time using Subdue. ... As we want to look for CRCSs which were appearing at a given time, we also need to pick two ranges of years, the negative range and the positive range. The negative range is usually a set of years from the past, in which these substructures (i.e., CRCSs) are not meant to exist. The positive range is usually a set of years dated after the negative range, in which the substructures are meant to be present.

Identification of the common research categories substructures in the world
The aim of the second scientogram analysis task is to uncover the CRCSs in the world by analyzing the scientograms of a large number of different countries. ... All the selected maps representing the scientific production of those countries for that given year will be viewed as positive examples, so the goal of Subdue will be to extract the substructures with the best support among all of them. Notice that, no negative examples are considered in this case. As the user will be specially interested on the extracted CRCSs to be as specific as possible, the MDLi measure will be again considered to extract both frequent and large substructures.

Comparison of the scientific domains of different countries
To do so, the scientogram of each country in a given set is compared against the remaining ones in that set, the current country viewed as a positive map and the others as negative maps. ... Note that this experiment could also be done using time periods larger than a single year, or more than one country in the positive set each time, thus allowing an expert to extract the substructures highlighting the possible similarities between these countries.

2013年5月14日 星期二

White, H. D. (2003). Pathfinder networks and author cocitation analysis: A remapping of paradigmatic information scientists. Journal of American Society for Information Science and Technology, 54, 423-434.


White, H. D. (2003). Pathfinder networks and author cocitation analysis: A
remapping of paradigmatic information scientists. Journal of American Society for Information Science and Technology, 54, 423-434.
本研究認為當研究者引用文獻時,在論文上提到文獻上註明的作者姓名事實上包括兩種涵意:一為作者的作品所涉及的相關主題,另一為作者本身所被認定從屬的學術專長(scholarly specialties)或是思想學派(schools of thought)、心智社群(communities of the mind),有時甚至是其相識的網絡。而利用發現於共被引者間的相識或相互通訊的連結的大型作者共被引分析(author cocitation analysis, ACA)能夠映射出接近的無形學院(invisible college)。傳統的作者共被引分析利用Pearson相關係數(Pearson's correlation coefficients)和多維縮放(multi-dimensional scaling, MDS)等統計技術,將作者根據他們被共同引用情形的相似性,映射到二維的圖形上,使每一位作者成為圖形上的一個映射點,被共同引用情形較相似的作者其映射點之間的距離較近,反之則較遠。產生的圖形僅有散佈的映射點,適合熟悉學術領域的圈內人審識或重新認識,但圈外人則不易發現其意義,甚至由於產生的圖形過於簡化,無法表達學術領域內各作者間的關係形成之典範(paradigms)的複雜性,即便是圈內人在審視時也不一定能認同圖形呈現的結果。本研究利用原始的共被引次數(raw cocitation counts)產生網路圖,藉由尋徑者網路(pathfinder networks, PFNETs)技術刪減網路上較不重要的連結線,並以Kamada-Kawai演算法繪製圖形,藉以發現學術領域上的研究主題和相關的重要研究人員。在經過PFNETs處理的網路圖上,將具有高程度中心性(high degree centrality)的作者,也就是具有較多連結線的映射點所對應的作者,視為是主要的作者。主要的作者和連結他們的其他作者構成的密集樣式可定義為他們的專長,主要作者彼此間的連結則將各個專長相關的研究主題聯繫構成整個領域。本研究以White & McCain(1998)研究同樣的作者共被引資料分析資訊科學領域的學術專長及主要作者,結果發現網路圖上各種中心性都以Salton、Garfield、Lancaster和Price等四位最高,是這個領域的主要作者,同時網路圖上的樣式也呈現1972-1995年間資訊科學的典範。
information visualization
In PFNETs, nodes represent authors, and explicit links represent weighted paths between nodes, the weights in this case being cocitation counts. The links can be drawn to exclude all but the single highest counts for author pairs, which reduces a network of authors to only the most salient relationships. When these are mapped, dominant authors can be defined as those with relatively many links to other authors (i.e., high degree centrality).
Links between authors and dominant authors define specialties, and links between dominant authors connect specialties into a discipline.
White and McCain’s raw data from 1998 are remapped as a PFNET. It is shown that the specialty groupings correspond closely to those seen in the factor analysis of the 1998 article.
During the past 20 years, several map making techniques have been tried in ACA. Raw cocitation counts and Pearson r correlations of author pairs have both been used as input; output displays have included multidimensional scaling, complete-linkage clustering, factor-loading plots, Kohonen self-organizing maps, geographic-style maps, and Pathfinder networks (PFNETs).
Afterward, the interest of the name combinations is dual.
As designators of oeuvres, names jointly connote intertextual themes—lines of exposition and perhaps controversy.
As designators of people, names jointly connote scientific or scholarly specialties, schools of thought, communities of the mind, and—sometimes— networks of acquaintances. To the extent that ties of acquaintanceship and intercommunication are found among cocitees (which is often), large-scale ACA maps approximate invisible colleges.
Even so, ACA maps have obvious limitations. To be useful, they must depict a domain the viewer already knows, or at least is curious about; names that fascinate insiders will bore outsiders. Furthermore, the maps will not—cannot— capture all of the relations among authors that give a paradigm its complexity. The whole point of ACA mapping is to simplify. But in simplifying relationships to those most salient in the database, ACA may contradict how a field is viewed in individual heads.
Pearson r detects the similarity of count profiles across all authors, and it was chosen over single-highest counts because the latter can vary across three orders of magnitude, which results in high-end pairs overwhelming low-end pairs in multidimensional scaling (cf. McCain, 1990). However, the computation of Pearson rs adds another layer of complexity to getting the maps out. PFNETs at r=INF remove this layer because they are not affected by absolute magnitudes of the counts, only by whether the counts are higher or lower when algorithmically compared.
Raw-count PFNETs are actually more informative than those made with Pearson rs, because when many authors share their highest counts with a single dominant author, specialty or subspecialty structure emerges automatically, and there is no need for a separate clustering routine.
The citation counts of several hundred contributors to information science during the 24-year period 1972–1995 were obtained in early 1996 (White & McCain, 1998). The top 120 names from this list were systematically paired and their raw cocitation counts taken from ISI’s Social Scisearch (the on-line Social Sciences Citation Index) on Dialog. Those counts are reused here.
The resizing shows four authors dominating information science in the 1972–1995 period—Gerard Salton, Eugene Garfield, F. W. Lancaster, and, to a lesser extent, Derek Price. The same four authors are also highest on two other actor centrality measures available in Pajek and UCINet, closeness and betweenness. Closeness is the inverse of “farness,” which is the sum of all shortest paths (geodesics) from any author to any other author in the network. Betweenness counts the number of geodesics on which any node lies.
One attractive feature of raw-count PFNETs is that they not only form specialties around dominant authors but also chain the specialties in
explicit sequences. The ordering of nodes in these sequences is non-arbitrary, and reveals how major topical areas in a field are connected.
If one traverses the most highly connected nodes from right to left in Figure 1, this sequence suggests itself: Markey -> Bates -> Belkin -> Saracevic -> Salton -> Lancaster -> Garfield -> Price -> Brookes. These and their associated authors represent paradigmatic information science of the 1972–1995 period (cf. Ding et al., 1999; Persson, 1994; Urs, 1995).
The authors from Markey to Saracevic share a focus on non-experimental document retrieval systems (e.g., on-line bibliographic databases, on-line library catalogs) and their users.
Salton and Garfield dominate two large central groups that I have elsewhere called, respectively, the retrievalists and the citationists. The retrievalists generally bring high formal and computational skills to problems of designing and evaluating experimental systems for document retrieval. The citationists analyze properties of the scientific and scholarly literatures from which documents are retrieved, especially the citation linkages that became amenable to study after Garfield founded the Institute for Scientific Information and its databases.
According to the closeness and betweenness measures in Table 1, he (Lancaster) is the most central figure in the map. He is also at the center of a group of generalists, many of them active for decades in the movement to automate various information services. The generalists are more oriented toward existing library and bibliographic institutions than the retrievalists, and are perhaps inclined to a more encyclopedic range of interests, including information policy issues.
The group around Garfield has further links leftward to authors who also analyze literatures, often in the context of scientific communication studies in general—various citationists, bibliometricians, and scientometricians centered on Price and B.C. Brookes. Brookes and his group represent mathematical bibliometrics.
Simon and his neighbors are used in conceptualizations of the nature of information studies; he and Zipf are also cited in bibliometrics. The Price -> Merton -> Crane line that ends in Rice is the intersection of science and technology studies with information science (both fields, for example, have used the idea of “invisible colleges”); Steven and Jonathan Cole, Harriet Zuckerman, and Thomas S. Kuhn symbolizethis area as well. Most of these social scientists have contributed to literature-based domain analysis (e.g., Kuhn stated that the history of paradigms might be tracked through citations) and fit comfortably on Garfield’s side of the map.
One way in which the PFNET in Figure 1 does differ from the multidimensional scaling (MDS) maps in White and McCain (1998) is in its rendering of “disciplinary centrality.”
PCAs in both the 1998 article and White and Griffith (1982) demonstrated that paradigms comprise “crystallized” authors, who load on a single factor, and “diffuse” or “pervasive” authors, who load less strongly on two or more factors.
The PFNET in Figure 3 (with Pearson r correlations) simply chains together the highest rs for particular author pairs. This fails to render specialties in such a way that even a complete outsider can see them, as did Figure 1 (with raw cocitation counts). ...  The Figure 3 PFNET also removes all sense of the field’s most important authors.
Pearson r correlations, and the tables and displays based on them, will retain their usefulness in certain kinds of ACA, but they do not make for the best PFNETs.
Pathfinder network analysis comes out of semantic association studies in cognitive psychology (Schvaneveldt, 1990), but it shares a foundation in mathematical graph theory with social network analysis, an active subfield of sociology and anthropology (Wasserman & Faust, 1994). A collateral benefit of working with PFNETs is that it bonds ACA to cognitive semantics and to social network analysis.
More important, the move to PFNETs makes explicit what has been true all along—that ACA is a kind of network analysis: authors as people form social networks; authors as oeuvres are formed by citers into semantically rich citation networks.

Zhao, H. and Lin, X. (2010). A comparison of mapping algorithms for author co-citation data analysis. In Proceedings of American Society for Information Science and Technology 2010, pp. 13, October 22–27, 2010, Pittsburgh, PA, USA.


Zhao, H. and Lin, X. (2010). A comparison of mapping algorithms for author co-citation data analysis. In Proceedings of American Society for Information Science and Technology 2010, pp. 13, October 22–27, 2010, Pittsburgh, PA, USA.

本論文比較 MDS和凝聚式階層叢集, PFNet, SOM和Blondel社群偵測等四種映射演算法在作者共被引分析的應用,所採用的資料集是資訊科學1999-2008年間被引用次數最高的前100位作者的共被引次數。四種方法的映射結果都能夠明顯發現包含資訊檢索、使用者研究和書目計量學等三個主題的作者群集,另外在自組織映射圖裡還發現了基本理論研究的作者群,而Blondel社群偵測法則另外發現了人機互動與社會資訊學(social informatics)兩個群集。作者認為這四種方法的結果都可以發現資訊檢索、使用者研究和書目計量學等相同的群集,顯然作者共被引分析是有效的。在四種方法中,MDS和凝聚式階層叢集以及Blondel社群偵測兩種方法能清楚地表示整個領域的區分情形,PFNet和SOM則是可以利用在圖形上映射位置的相對關係對領域有細膩地描述;並且MDS和凝聚式階層叢集以及PFNet方法都可以從圖形上的布局清楚地理解看出群集的從屬關係,SOM對此功能較為欠缺,Blondel社群偵測則由於有較多的連結線,因此容易顯得雜亂。

In this study, we selected and applied four of the mapping methods to the same dataset, the author co-citation matrix of the top 100 highly cited information scientists.

Dataset used in this paper is a 100 by 100 author co-citation matrix, of which the rows and columns are the top 100 highly cited authors in Library and Information Science (LIS) during the 1999 to 2008 period.

We applied here four algorithms in the mapping process of our dataset: (1) Multidimensional Scaling with Agglomerative Hierarchical Clustering; (2) Pathfinder Networks; (3) Kohonen Map; and (4) Blondel Community Detection Algorithm.

In this method, multidimensional scaling is used for ordination and agglomerative hierarchical clustering for grouping authors. We use Pearson r as the measure of similarity between authors. The 100 by 100 co-citation matrix is converted to Pearson r correlation matrix, before being submitted to multidimensional scaling and agglomerative hierarchical clustering procedures.

Figure 1 shows that there are four distinct clusters identified. We label them as Bibliometrics I, represented by ROUSSEAU R, EGGHE L etc., Bibliometrics II (citation) by WHITE HD, MCCAIN KW etc., Information Retrieval by SALTON G, JONES KS, etc., and User Study by BELKIN NJ, BATES MJ, etc.

Pathfinder Networks algorithm approaches the ACA mapping problem as a graph pruning problem. With nodes representing authors, weighted links representing their cocitation counts, the goal is to discard insignificant links while preserving the salient semantic connection patterns in the original network (Schvaneveldt, 1990).
The result (Figure 2) shows that there are three major clusters identified, with GARFIELD E centered the Bibliometrics cluster, SALTON G the Information Retrieval cluster, and BELKIN NJ the Information Behavior cluster.
Kohonen Map algorithm is an unsupervised learning algorithm in the family of artificial neural networks (Kohonen, 2000). It learns the underlying structure of the original high dimensional inputs in a recursive process and presents the results as rectangle regions.
Figure 3 shows our Kohonen Map for the 100 authors. Several distinct regions are labeled, including User Study represented by BATES MJ, KUHLTHAU CC, etc., Information Retrieval by  SALTON G, CROFT WB, etc., and Bibliometrics by GARFIELD E, SMALL H, etc.. An interesting group shown explicitly on this map is the Theorist, including WILSON P, BUCKLAND MK, BUDD JM, etc.
Community Detection Methods treat the mapping problem as a graph division problem (Newman & Girvan, 2004). We apply on the 100 by 100 co-citation matrix the Blondel community detection algorithm introduced in Wallace, Gingras, & Duhon (2009). Implementation of this algorithm is based on the Network Workbench (NWB Team, 2006).
Five communities of different sizes are identified. A visualization using the Circular Hierarchy layout is showed in Figure 4. In additional to the three major clusters, Information Retrieval, Bibliometrics and User Study, which are identified in other clustering methods, another two distinct clusters, Human Computer Interaction and Social Informatics are detected using this method.
... different algorithms reveal the structure of LIS in different manners: MDS with AHC and Blondel Community Detection give clear global division of the field, while PFNET and Kohonen Map preserve much finer granularity descriptions in terms of the relative positioning of LIS authors.
Among all the mapping layouts, PFNET and MDS are most easily to comprehend, because grouping and membership information can be easily derived from their layout. While Kohonen Maps present richer information about local proximity among authors, it fails to show membership information at a larger scale. For the community detection algorithm, because it does not do any edge pruning, it generates cluttered mapping result.

2013年5月13日 星期一

Shibata, N., Kajikawa, Y., Takeda, Y., and Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28, 758-775.


Shibata, N., Kajikawa, Y., Takeda, Y., and Matsushima, K. (2008). Detecting emerging research fronts based on topological measures in citation networks of scientific publications. Technovation, 28, 758-775.

本研究首先將領域相關的論文依據它們之間的引用情形視為是一個網路圖,以每一篇論文為網路圖上的一個節點,並將某一篇論文與其引用的其他論文建立連結,再利用Newman(2004)提出的演算法以網路圖的型態(topology)將節點進行叢集,使得產生的節點集合內有較密集的引用關係,集合之間的引用關係則較稀疏。由於通常論文會引用研究主題與其相關的論文為參考文獻,可以從一群彼此間有密集引用關係的論文中發現它們共同的研究主題。找出可以視為是研究主題的集合後,接著計算集合內論文的平均年份(average age)和彼此間的關係,每一個集合並且利用tf*idf,找出特徵值較大的詞語做為集合相關主題的標示,然後針對論文中被引用次數較多者計算它們的集合內程度(within-cluster degree)z和參與係數(participation coefficient)P。z的數值由論文對映節點與同一集合其他節點的連結數經過z-score正規化後產生,如果論文具有較大z值表示這篇論文與同一集合的其他論文之間有較多的引用關係;P值則是表現論文連結的集合數,P愈大反應連結的集合愈多。本研究以GaN和複雜網路(complex network, CN)兩個領域為分析的案例,從這兩個領域前十篇引用較多的論文的集合內程度z和參與係數P的計算結果發現:GaN的z和P都有較大的數值,CN雖然有較大的z值,但P值較小,也就是GaN的被引用數較多的論文除了多數會連結到集合內的其他論文外,也會連結到其他集合,但是CN的論文則大多只連結到本身的集合,因此可以推論GaN的研究屬於增進式創新(incremental innovation),而CN則是分支式創新(branching innovation)。

information visualization

We divided citation networks into clusters using the topological clustering method, tracked the positions of papers in each cluster, and visualized citation networks with characteristic terms for each cluster. Analyzing the clustering results with the average age and parent–children relationship of each cluster may be helpful in detecting emergence. In addition, topological measures, within-cluster degree z and participation coefficient P, succeeded in determining whether there are emerging knowledge clusters.
There were at least two types of development of knowledge domains. One is incremental innovation as in GaN and the other is branching innovation as in complex networks. In the domains where incremental innovation occurs, papers changed their position to large z and large P. On the other hand, in the case ofbranching innovation, they moved to a position with large z and small P, because there is a new emerging cluster, and active research centers shift rapidly. Our results showed that topological measures are beneficial in detecting branching innovation in the citation network of scientific publications.
Massini et al. (2005) discussed the difference between pioneers (innovators) and adopters (imitators). For innovators and early adopters, it is essential to detect emerging research fields promptly before other competitors enter the research domain.
In fact, Sorenson and Fleming observed that patents that refer to scientific materials receive more citations (Sorenson and Fleming, 2004; Fleming and Sorenson, 2004). This partially supports the hypothesis that scientific publications play an important role in accelerating technological innovation.
Therefore, for both R&D managers in companies or research institutions and policy makers, noticing emerging research domains among numerous academic papers has become a significant task. However, such a task becomes highly laborious and difficult as each research domain becomes specialized and segmented.
There are two approaches to detecting emerging research domains and the topics discussed there (Kostoff and Schaller, 2001).
One straightforward manner is the expert-based approach, which utilizes the explicit knowledge of domain experts. However, it is often time-consuming and is also subjective in the current information-flooded era.
Another is the computer-based approach, which is compatible with the scale of information, and it is therefore expected to complement the expert-based approach. There is a commensurate increase in the need for scientific and technical intelligence to discover emerging research domains and the topics discussed there, even for unfamiliar domains (van Raan, 1996; Kostoff et al., 1997, 2001; Losiewicz et al., 2000; Boyack and Boner, 2003; Porter, 2005; Buter et al., 2006).
The temporal patterns of co-cited clusters are usually tracked to detect emerging fields with a variety of visualization techniques.
The multidimensional scaling (MDS) plot on a two-dimensional (2-D) plane is a typical example of such visualizations (Small, 1977). However, spatial configurations in MDS do not show links explicitly.
There are number of efforts to improve the efficiency of visualization such as a self-organizing map (SOM) (Skupin, 2004) and a pathfinder network (PFNET) (Chen, 1999, 2004). White et al. (2004) compared these two visualization techniques and noted that while PFNETs seem to be directive about relationships, SOMs are merely suggestive.
However, it causes two problems. One is the deficiency of relevant papers. It is not always true that a research domain can be represented by a single keyword. Another is the surplus of papers. In some cases, the same keyword is used in different research domains, which includes the noisy papers to the corpus.
To overcome the first problem, we use broad queries to retain wide coverage of citation data. For the second problem, we analyze only the maximum component of the citation networks. By doing this step, non-relevant papers that do not cite papers in the corresponding research domain are removed.
After extracting the maximum component, we perform the topological clustering, in order to discover tightly knit clusters with a high density of within-cluster edges with Newman’s algorithm (Newman, 2004). With this process, citation networks are divided into clusters, within which papers cite densely each other.
In the last step, two topological measures, within-cluster degree, zi, and participation coefficient, Pi, proposed byGuimera and Amaral (2005) are calculated in order to track the position of each paper in the clustered citation network.
As a result, we obtained the data of 15,134 papers on GaN and 7370 papers on CN that had been published from 1970 to 2004.
Additionally, the analysis of intercitation is more straightforward than co-citation. Klavans and Boyack (2006) compared the similarity of the clustering results by intercitation to that by co-citation. They concluded that intercitation is more appropriate for the clustering of the similar documents. Intercitation also allows us to group papers that are only rarely cited, which is a significant portion of all papers (Hopcroft et al., 2004).
Amongst many clustering methods and algorithms, in this paper we apply a method proposed by Newman which is able to deal with large networks with relatively small calculation time in the order of O((m+n)n), or O(n2) on a sparse network, with m edges and n nodes; therefore, this could be applied to large-scale networks (Newman, 2004).
The algorithm proposed is based on the idea of modularity. Q=Tr(e)-||e||2, ... The first part of the equation, Tr(e), represents the sum of density of edges within each cluster. A high value of this parameter means that nodes are densely connected within each cluster.... The second part of the equation, ||e||2, represents the sum of density of edges within each cluster when all edges are placed randomly. ...  Q is the fraction of edges that fall within communities, minus the expected value of the same quantity if the edges fall at random without regard for the community structure.
A high value of Q represents a good community division where only dense-edged remain within clusters and sparse edges between clusters are cut off, and Q = 0 means that a particular division gives no more within-community edges than would be expected by random chance.
After dividing the papers into optimized clusters using Newman’s method, the role of each paper is determined by its within-cluster degree and its participation coefficient, which define how the node is positioned in its own cluster and between clusters (Guimera and Amaral, 2005). ... Within-cluster degree zi measures how ‘‘well connected’’ node i is to other nodes in the cluster... Participation coefficient Pimeasures how ‘‘well distributed’’ the edges of node i are among different clusters.
According to the within-cluster degree, they classified nodes with z>=2.5 as hub nodes and nodes with z<2.5 as non-hub nodes.
In addition, non-hub nodes can be naturally divided into four different roles:
(R1) ultra-peripheral nodes; that is, nodes with most of their edges within their cluster (P<0.05),
(R2) peripheral nodes; that is, nodes with many edges within their cluster (0:05<P<=0:62);
(R3) non-hub connector nodes, that is, nodes with a high proportion of edges to other clusters (0:62<P<=0:80);
and (R4) non-hub kinless nodes, that is, nodes with edges homogeneously distributed among all clusters (P>0.80).
Similarly, hub nodes can be classified into three different roles:
(R5) provincial hubs, that is, hub nodes with the vast majority of edges within their cluster (P<0.30);
(R6) connector hubs, that is, hubs with many edges to the other clusters (0.30<P<=0.75);
and (R7) kinless hubs, that is, hubs with edges homogeneously distributed among other clusters (P>0.75).
In GaN, where incremental innovation occurred, the top 10 papers changed position from (R2) peripheral nodes to (R6) connector hubs as the domain developed.
However, in CN, where branching innovation occurred, the top 10 papers moved from (R1) ultra-peripheral nodes to (R5) provincial hubs, and became provincial hubs.
For both R&D managers in companies or research institutions and policy makers, there are two types of approaches, i.e., expert-based and computer-based approach to notice emerging research domains among numerous academic papers. However, the former approach becomes a highly laborious and difficult task as each research domain becomes specialized and segmented.
Our computer-based method, at least, complements this expert-based approach for the following three reasons.
First of all, experts’ judgment is not always right, especially in the current information-flood era. Sometimes, once-humble researchers accomplish great scientific achievements. Experts may fail to give credit to emerging trends.
Second, gathering experts is expensive. Identifying the quality of these papers before they become a new emerging cluster requires numerous experts.
Finally, our method is scalable. Even if the publication cycle becomes shorter and the number of publications grows, the computer-based approach could be effective.
Moreover, although the previous researches in knowledge mapping emphasized on the method of visualization in order to detect emergence, our method enables us to detect by monitoring variables, such as z and P. When we use visualization, we must judge the emergence of research cluster by the visualized map itself. Utilization of quantitative variables such as z and P open a way to detect it by machine-friendly manner.
In domains where incremental innovation occurs, hub papers are connector hubs with large z and large P. On the other hand, in the case of branching innovation, there is a new emerging cluster and active research centers shift rapidly and hub papers become provincial hubs with large z and small P.
This means that in the case of GaN, hub papers have intercluster edges, which connect some clusters; however, in the case of CN, hubs connect mainly in their own clusters and have few intercluster edges.
In the detection of emerging research domains, the shortcoming of this approach is the existence of time lag. It takes 1 or 2 years until a paper receives citations from other papers. It also takes 1 or 2 years from the completion of research to the publication of the research. Therefore, in the context of TIM and research policy, policy makers should complement this approach with not-published information such as academic conference and expert opinion.

Groh, G. and Fuchs, C. (2011). Multi-modal social networks for modeling scientific fields. Scientometrics, 89, 569-590.


Groh, G. and Fuchs, C. (2011). Multi-modal social networks for modeling scientific fields. Scientometrics, 89, 569-590.

network analysis

本研究將社會網路分析的概念與方法應用到領域分析,以學術領域相關的論文資料來建立研究者的合著網路(co-authorship network)、人員-組織網路(person-organization network)、共被引網路(co-citation network)、期刊-人員網路(journal-person network)和研討會-人員網路(conference-person network)等各種網路圖。本研究以行動社會網路(mobile social networking)為分析的學術領域,蒐集了933筆相關的論文資料,建立上述的各種網路圖並進行中心性(centrality)和節點叢集(clustering)等分析。合著網路上共有1687個節點以及2926條連結線,共分為538個成分(components),最大的成分上有200個節點,大約占所有節點的11.86%,這個成分的直徑(diameter)為13,平均最短路徑長度為5.21,另外密度為0.0394、整體叢集係數(global clustering coefficient)是0.63。利用tf-idf的方式標註各成分的作者所探討的主題,最大的成分其主題為應用(application)、行動性(mobility)、社會關係(social relations)和有用性(usefulness)。作者共被引網路上則有1687位作者、52928個共被引對(co-citation pairs),最大的成分上有1490位作者(~88.3%)以及51171條連結線,這個成分的直徑為7,平均最短路徑長度為2.8672,另外密度為0.0475、整體叢集係數以及平均區域叢集係數(average local clustering coefficient)分別是0.7和0.86。並且為了瞭解行動社會網路的主要研究主題,針對最大成分以Clauset, Newman and Moore (2004)提出的以模組性為基礎的演算法進行叢集,叢集結果的模組性為0.616,然後利用尋徑者網路(path-finder network)演算法 (Schvaneveldt et al. 1989)將網路進行視覺化,並且在圖形上標示網路叢集獲得的七個子群本研究建議可以將這些分析應用於1)以網路圖提供整個領域一個完整的概觀(overview)、2)對網路上的某一節點或某一群節點在一段時間內的變化與動向進行追蹤(tracking)以及3)觀察網路圖在時間上的演變(evoluation)等三種服務。


When two authors publish their work together, both authors are treated as nodes and are connected by an edge (their publications) in the co-authorship graph. The edge can be weighted e.g. to reflect the number of papers published jointly or to illustrate temporal aspects. Co-authorship networks are—as typical representatives for social networks—scale-free and conform to the small world phenomenon (Barabasi et al. 2002; Porter et al. 2009).
The assumption for co-citation analysis is that two documents, authors, journals or other objects which get cited jointly by a (later) third document have—at least from the perspective of the citing author—some coincidence in terms of content. The more frequently two objects get cited together the more this similarity is stressed. This technique was presented for documents (document co-citation) by Small (1973) and for authors (author co-citation) by White and Griffith (1981) and is often used to create a semi-automatic overview of the literature of a scientific field.
By now, many author co-citation studies have been performed for many different fields (e.g. Chen and Carr 1999; McCain et al. 1990; Tsay et al. 2003; White and McCain 1998; Zhao and Strotmann 2008). Current approaches use Pathfinder Networks (Buzydlowski 2003; Chen and Morris 2003; Chen and Hsieh 2007; Lin et al. 2003; McCain et al. 1990; White 2003b) or self-organized maps (Buzydlowski 2003; Lin et al. 2003).
In order to more precisely define ‘to cover’ one basically has to answer three questions/define three sub-concepts of ‘to cover’:
– Define criteria whether an article belongs to the domain in question
– Define criteria when the data-set is sufficiently large to map the relevant structures in the corresponding networks.
– Decide upon the set of meta-data recorded for the construction of the networks
The resulting data set which was collected in July and August 2009 consists of 933 articles and their associated items/objects from the scientific domain ’Mobile Social Networking’, forming a multi-modal network.
The data-model of items considered encompasses persons (authors and researchers), documents (articles), journals (and journal issues),conferences (and conference instances) and projects (which is an abstraction of scientific projects, working groups and other target-oriented organizations of persons) and free optional tags for every item.
The co-authorship network derived from the collected data consists of 1687 nodes and 2926 edges. The graph resolves into 538 components with the biggest component containing 200 nodes (approx. 11.86%).
The Person-Organization Network is the graph which results from looking at the relations between authors and their affiliations (companies, universities, research centers, etc.) retrieved from the articles in the database. The bipartite graph consists of 393 components (1194 author nodes and 544 nodes standing for organizations). The biggest component contains 330 nodes (~15%) and is composed of 277 authors and 53 organizations.
The average degree of an organization node is 6.62 (median 4.0) and the standard deviation is 7.76. For the nodes representing a person the average degree is 1.27 (median 1.0) with a standard deviation of 0.56. This implies that a big part of the authors in the data set maintain relations to only one organization and that many authors concentrate on a few organizations (17 organizations only have one assigned author each, whereas just two organizations— Carnegie Mellon University and MIT Media -Lab—are connected to more than 30 authors).
An author co-citation analysis can be done at least in two ways:
The traditional way ensuing McCain’s paper (McCain 1990) uses a vector model to compare the co-citation profiles of authors. For each author, a vector is calculated which contains the author cocitation count for each other author of the data set. Afterwards, the analysis compares the author vectors using a measure like cosine (van Eck and Waltman 2008; Egghe and Leydesdorff 2009) or Pearson correlation (McCain 1990). A link between two authors does not necessarily mean that both got co-cited, it just tells something about the similarity of their co-citation vectors (i.e. how they get co-cited with all authors).
The other approach used in studies like White (2003b) works on the raw data: a link between two authors expresses that these two authors got co-cited (and does not reveal something about their relationship to any other author).

Since this study focuses on social networks, the second approach is discussed in detail.
The data set contains 1687 authors who make up 52928 co-citation pairs. 1490 authors (~88.3%) form the biggest component with 51171 edges. The remaining 197 authors are distributed among 148 additional components which are noticeably smaller. The diameter of the biggest component is 7, the density is 0.0475, the average path length is 2.8672, the global clustering coefficient is 0.7 and the average local clustering coefficient is 0.86.
The next question is whether it is possible to split the big component into several clusters with different research topics within the mobile social networking community. Thus the big component was clustered using a clustering method based on modularity by Clauset, Newman and Moore (Clauset et al. 2004). The modularity reached with the clustering mechanism is 0.616 and relatively high.


A Pathfinder Network (Schvaneveldt et al. 1989) of the largest component can be seen in Fig. 2.

Cluster #1 contains documents which deal with different use cases for context sensitive applications. The example applications cover e.g. tourist information systems and the fast development of prototypes for mobile, context sensitive applications.
In Cluster #2 the theory of scale-free networks is the main topic, important authors are Re´ka Albert, Albert-La´szlo´ Baraba´ si, Duncan J. Watts and Mark Granovetter.
Cluster #3 highlights security and data privacy, the main authors are Marco Gruteser, Jason I. Hong, Paul Dourish und Anind K. Dey.
The authors in cluster #4 write about ubiquitous computing, mobile applications with social, local and contextual reference.
In cluster #5 the influence of new forms of communication and the general technological development for our society are examined.
The topics in cluster #6 are sensor networks based on mobile phones.
Compared with the other clusters, cluster #7 is rather hardware-centric and deals with delay tolerant
networks.
This part of the article discusses the derived journal-person network. The bipartite graph contains 151 components with more than one node and consists of 861 authors and 242 journals. In this part, the largest component with 443 authors and 69 journals is discussed.
The nodes represent journals and two journals are connected by an edge when at least one person exists who published in both journals. The weight of an edge correlates to the number of authors who published in both of the connected journals. Thus, the decision on how similar two journals are is left to the authors, the users of the journals, themselves. ... The density of the network is 0.084, it consists of 69 nodes (journals) and 196 edges.
The Pathfinder Network shown in Fig. 3 reaches from the area of human computer interaction (upper left) to the social sciences (upper right):
The network consisting of conferences and persons contains 121 components with more than one node. A component has 7.66 nodes on average (with a standard deviation of 17.29). The largest component includes 185 nodes (see Fig. 4).
– overview services: the primary goal is to get a comprehensive overview of the modeled scientific area
Especially newcomers in a scientific field can have problems acquiring a broad overview of the area. The approach presented here can help to analyze the scientific area more in detail.
The author co-citation network can be used to locate the authors concerning their research interests. The representation as Pathfinder Network is—besides the application of a clustering mechanism—useful to get easily interpretable results. The author cocitation qualifies because the decision concerning the similarity of authors is done by the authors themselves. A disadvantage is that the picture drawn with author co-citation analysis only shows the past and not the present since it takes a while since a newly published paper won’t get cited immediately.
The collaboration of authors can be visualized using the co-authorship network. This graph can be used to identify definable schools within the scientific field (to get results which can be interpreted easily a clustering mechanism and a Pathfinder Network rendering can help).
A concrete service could allow the user to identify different sub-areas within the scientific field. ... The user has the option to focus his research on interesting sections only (ignoring the other subareas) or to do a more macro-orientated analysis and work on the whole graph.
The analysis of graphs with journals or conferences offers an insight in the communication patterns within the scene: from the perspective of the newcomer, this view on the graph can reveal interesting literature, the more experienced researcher can use this view to identify the best fitting journals for his/her articles.
The network of organizations can help active researchers to identify interesting organizations as career options: if an author specialized in a specific topic it would be favorable to work at an organization which already has influence in the specific area.
– tracking services: one or more nodes of the graph are separately tracked together with a history of their dynamics over a specific time-period. This type of service can unleash its full potential not until the update process of the model can be done automatically, because frequent regular manual model updates are uncomfortable and resource consuming for the user.
If the underlying network can be updated automatically, tracking services can be implemented effectively. These services observe a set of nodes and document their development over time.
– evolution services: these kind of services try to explain the development in the model starting from a given point in time up to the current situation.
... for this type of services the network needs to be updated on a regular basis (manually or, preferably, in an automatic way). Such a service can be useful for people who have been working in the observed field for a certain time but had to interrupt their work for a longer period of time (e.g. for other projects, parental leave, sabbatical, etc.). ... A representation with a graph visualization tool like SoNIA (McFarland and Bender-deMoll 2009) might be helpful too, in order to get a high level overview of the changes in the area. SoNIA displays changes in a graph as an animation which documents the dynamics of the network on a step by step basis.