Quirina, A., Cordóna, O., Vargas-Quesadab, B., Moya-Anegón, F. (2010). Graph-based data mining: A new tool for the analysis and comparison of scientific domains represented as scientograms. Journal of Informetrics, 2010, 291-312.
network analysis
本研究利用基於圖形的資料探勘(graph-based data mining)方法,嘗試從數個科學圖表(scientograms)中發現它們之間共同的次結構(substructures),藉以分析以下的三個問題:1)全國科學研究領域的逐年進展情形,2) 世界各國共同研究類別次結構(research categories substructures)的抽取以及3)不同國家間科學研究領域的比較。本研究首先對分析的73個國家每一年的科學研究進行研究類別的共被引分析,建構以每一個研究類別為節點、類別間的共被引相關程度為連線的網路圖做為本研究探討的科學圖表,每一條連線的權重計算方式為CM(ij) = Cc(ij) + Cc(ij)/sqrt(c(i)*c(j)),其中c(i)和c(j)分別是i與j兩個類別的被引次數,Cc(ij)則是i與j兩個類別的共被引次數。當某兩對類別的共被引次數不相等時,由於 Cc(ij)/sqrt(c(i)*c(j))的值介於0與1之間,此一權重計算方式對於共被引次數較大者可以得到較大的數值;當兩對類別的共被引次數相等時,兩個類別的被引次數接近於共被引次數的情況下,可以得到較大的數值。然後將建立起來的網路圖經過尋徑者網路(pathfinder networks)演算法進行維度縮減處理(dimensionality reduction)。進行處理時,將尋徑者網路演算法的參數r和q,分別將r和q設定為r =∞ , q = n − 1。對於任意兩個節點間的連結線,如果除了此一連結線,以這兩個節點為端點的任何的一條路徑,其上的某一個連結線的權重大於此連結線的權重,則此連結線的重要性較不顯著(significant),便於網路圖上刪除此連結線。網路圖可以利用Kamada–Kawai等佈局演算法(layout algorithm),將節點與其之間由線連結所表現的關係呈現為視覺化圖像。本研究採用Subdue演算法(Cook & Holder, 1994, 2000)抽取各網路圖共同的次結構,Subdue演算法的運算原理是最小描述長度(minimum description length) (Rissanen, 1989)。將網路圖上的節點、連結線以及節點間的連結關係表示成位元串(bit string),位元串的長度總和便是網路圖的描述長度。假設各網路圖的集合為G,描述G的位元串的長度總和為I(G),並假定出現在多個網路圖中的各次結構為S,如果描述S的位元串的長度總和為I(S),將各網路圖上出現的此次結構以一節點取代的情況時,描述G的位元串的長度總和成為I(G|S)。最小描述長度的原理便是發現各個次結構使得I(S)+I(G|S)的值最小,實際上以valueMDLi(S,G) = I(G) / (I(S) + I(G|S))求取最大值。在評估各次結構時,除了求得最小描述長度(也就是最大的valueMDLi),另外次結構的選擇以規模(size)較大、支持度(support)較大者也是比較具有代表性的次結構。規模以網路圖與次結構中包含的節點和連結線總數來計算,評估方式為valuesize(S,G) = Size(G) / (Size(S) + Size(G|S))。支持度則是次結構被包含於多少網路圖內來計算,評估方式為valuesupport (S,G) = #graphs in G including S / card(G),card是網路圖的數目。如果考慮負圖表的情形,也就是希望次結構出現於負圖表愈少愈好,則描述長度的評估方式成為valueMDLi(S,Gp,Gn) = (I(Gp) + I(Gn)) / (I(S) + I(Gp|S) + I(Gn) − I(Gn|S)),其中Gp和Gn分別代表正圖表與負圖表;規模的評估方式成為valuesize(S,Gp,Gn) = (Size(Gp) + Size(Gn)) / (Size(S) + Size(Gp|S) + Size(Gn) − Size(Gn|S));支持度則成為valuesupport (S,Gp,Gn) = (#Gp graphs including S + #Gn graphs not including S) / (card(Gp) + card(Gn))。以下分別說明本研究探討的三個問題的實驗方式與結果:
1)全國科學研究領域的逐年進展情形
本研究以年度為單位,要分析的年度所產生的網路圖為正圖表,先前的年度為負圖表,發現顯著包含於正圖表內的次結構。
2) 世界各國共同研究類別次結構的抽取
以世界各國為單位,所有的國家所產生的網路圖都視為是正圖表,發現顯著包含於正圖表內的次結構。
3)不同國家間科學研究領域的比較
以世界各國為單位,要分析的國家所產生的網路圖為正圖表,其餘國家的網路圖為負圖表,發現顯著包含於正圖表內的次結構。
In this paper, we aim to show that graph-based data mining tools are useful to deal with scientogram analysis. Subdue, the first algorithm proposed in the graph mining area, has been chosen for this purpose. This algorithm has been customized to deal with three different scientogram analysis tasks regarding the evolution of a scientific domain over time, the extraction of the common research categories substructures in the world, and the comparison of scientific domains between different countries.
The visualization of scientific information has long been used to uncover and divulge the essence and structure of science (Börner & Scharnhorst, 2009; Chen, 1999a, 2004).
Yet despite its ripe age, information display is still in an adolescent stage of evolution in the context of its application to scientific domain analysis.
There is a large number of information visualization techniques which have been developed over the last decade within this area (Chen, 1999b; Lucio-Arias & Leydesdorff, 2008; Moya-Anegón et al., 2007, 2005; Small & Garfield, 1985), but none of them has been designed to support the exploration of large datasets. Besides, all the latter approaches require a large amount of expertise from the user, which reduces the chances to automate the analysis procedure. Nevertheless, it is clear that information visualization and visual data mining (Keim, 2002) can provide the theoretical and practical backgrounds to deal with scientific information analysis.
The generation of a big picture is something implicit in the process of visualizing scientific information. In an attempt to sum what has taken place to date up, we can say that nowadays there are two proposals for tracking down the big picture. On the one hand, one can adopt the traditional units of analysis (authors, documents, and journals) and, through their grouping, identify scientific disciplines following a bottom-up process (Boyack & Klavans, 2008; Klavans & Boyack, 2006; Small & Sweeney, 1985; Small, Sweeney, & Greenlee, 1985). On the other hand, the alternative uses the categories of the documents to the same end, and shows the scientific structure from them in a top-down manner (Moya-Anegón et al., 2004).
The former proposal (bottom-up process) presents all the pros of its fine-grained character, but it runs into difficulties in representing the totality of the panorama on a single plane and in tagging the disciplines.
That is, it (top-down process) is relatively simple to represent the scientific structure of a domain on a single plane by means of a maximum of 300 categories and their interrelation, avoiding tagging problems. However, this implies the acceptance of a classification of science in predefined categories, never transparent and always subjective, as well as the fact that documents are classified by the journals in which they are published and not by their content (coarse-grained character).
Current scientogram analysis techniques (Boyack, Börner, & Klavans, 2009; Chen et al., 2009; Klavans & Boyack, 2006; Leydesdorff & Rafols, 2009; Moya-Anegón et al., 2007) aim to provide a fine, detailed, tight view of a scientogram. To do so, they are based on performing a low-level analysis and comparison of the maps. Statistical techniques, computer algorithms, and macrostructure and microstructure techniques for the identification of thematic areas and scientific disciplines have already been used to analyze and compare scientograms (Boyack, Klavans, & Börner, 2005; Chen, 1999b; Lucio-Arias & Leydesdorff, 2008; Moya-Anegón et al., 2007; Wallace, Gingras, & Duhon, 2009).
However, this approach shows a main limitation: only a single or a very reduced set of maps can be analyzed or compared together. In fact, the field lacks an easy-to-use approach allowing the identification and the comparison of scientific structures within scientograms with a higher degree of automation.
Graph-based data mining (GBDM) (Cook & Holder, 2006; Holder & Cook, 2005; Washio & Motoda, 2003) involves the automatic extraction of novel and useful knowledge from a graph representation of data. By ‘novel’ we mean that the knowledge retrieved is not directly encoded in the data but deeply masked in it (hence, it requires to be uncovered), and by ‘useful’ we mean that the discovered patterns have in general an interest for the domain expert ...
In fact, GBDM techniques have been applied for frequent substructure discovery and graph matching in a large number of domains including chemistry and applied biology (Borgelt & Berthold, 2002; Huan et al., 2004), classification of chemical compounds (Deshpande, Kuramochi, & Karypis, 2002), and unsupervised and supervised pattern learning (Cook & Holder, 2006), among many others.
Subdue (Cook & Holder, 1994, 2000) is a graph-based knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. It aims to discover interesting and repetitive substructures in a structural database (DB). For this purpose, the minimum description length (MDL) principle (Rissanen, 1989) is used in order to compress the original DB into a hierarchical and shorter version.
In particular, we will describe how this algorithm can be customized to deal with three different scientogram analysis and comparison tasks regarding the evolution of a scientific domain over time, the extraction of the common research categories substructures in the world, and the comparison of scientific domains between different countries.
The generation of a scientogram following the top-down approach (Moya-Anegón et al., 2004) requires the sequential application of several techniques.
1. Units of analysis: The categories are the units of analysis and representation (Moya-Anegón et al., 2004; Vargas-Quesada & Moya-Anegón, 2007).
2. Unit of measure: ... a co-citation measure CM is computed for each pair of categories i and j as follows:
CM(ij) = Cc(ij) + Cc(ij)/sqrt(c(i)*c(j))
where Cc is the co-citation frequency and c is the citation frequency.
where Cc is the co-citation frequency and c is the citation frequency.
3. Dimensionality reduction: ... the Pathfinder algorithm (Chen, 1998; Dearholt & Schvaneveldt, 1990) is applied to the co-citation matrix to prune the network. Due to the density of the data, and especially in the case of vast scientific domains with a high number of entities (categories in our case) in the network, Pathfinder is usually parameterized to r =∞ and q = n − 1. This is done in order to preserve and highlight the salient relationships between categories, and for capturing the essential underlying intellectual structure of a scientific domain.
4. Layout: The spring embedder family of methods is the most widely used in the area of Information Science. Spring embedders assign coordinates to the nodes in such a way that the final graph will be pleasing to the eye, and that the most important elements are located in the center of the representation (also called its backbone). Kamada–Kawai’s algorithm (Kamada & Kawai, 1989) is one of the most extended methods to perform this task. Starting from a circular position of the nodes, it generates networks with aesthetic criteria such as the maximum use of available space, the minimum number of crossed links, the forced separation of nodes, the build of balanced maps, etc.
The need of mining structural data to uncover objects or concepts that relates objects (i.e., subgraphs that represent associations of features) has increased in the past ten years, thus creating the area of GBDM (Holder & Cook, 2005; Washio & Motoda, 2003). Nowadays, GBDM has become a very active area and several techniques such as Subdue, the Apriori family of methods (Apriori-based GM (Inokuchi, Washio, & Motoda, 2000), Frequent Subgraph Discovery (Kuramochi & Karypis, 2001), JoinPath (Vanetik, Gudes, & Shimony, 2002), etc.), and the Frequent Pattern-growth family of methods (CloseGraph (Yan & Han, 2003), FFSM (Huan, Wang, & Prins, 2003), Gaston (Nijssen & Kok, 2004), gSpan (Yan & Han, 2002), MoFa/MoSS (Borgelt & Berthold, 2002), Spin (Huan, Wang, Prins, & Yang, 2004), etc.) have been proposed to deal with problems such as graph matching, graph visualization, frequent substructure discovery, conceptual clustering, and unsupervised and supervised pattern learning (Cook & Holder, 2006).
Among them, we can highlight Subdue (Cook & Holder, 1994, 2000), a graph-based knowledge discovery system that finds structural, relational patterns in data representing entities and relationships. ... It is able to develop graph shrinking as well as frequent substructure extraction and hierarchical conceptual clustering.
Subdue (Cook & Holder, 1994, 2000) is a method for discovering interesting and repetitive substructures in a structural DB. The algorithm uses the MDL principle (Rissanen, 1989) to discover frequent substructures in a DB, extract them and replace them by a single node in order to compress the DB. These extracted substructures represent structural concepts in the data.
The Subdue algorithm can be run several times in a sequence in order to extract meta-concepts from the previously simplified DB. After multiple Subdue runs on the DB, we can discover a hierarchical description of the structural regularities in the data (Jonyer, Cook, & Holder, 2001).
Subdue can also use background knowledge, such as domain-oriented expert knowledge, to be guided and to discover substructures for a particular domain goal.
Subdue uses a variant of beam search (Lowerre, 1976) in order to avoid exponential-sized queue: at each step, only BeamWidth new children from a given parent are explored (see line 14). Furthermore, only a maximum of MaxBest substructures having a maximal size of MaxSubSize are returned to the user, and the algorithm does not develop more than Limit iterations (see line 6). These parameters ensure that the running time of Subdue is polynomial and is actually constrained by the BeamWidth and the Limit parameters (Jonyer et al., 2001).
The evaluation of a substructure (see line 13) can be computed by the MDLmeasure (see Section 3.1.1), the Size-measure (see Section 3.1.2), or the Support-measure (see Section 3.1.3).
The MDL of a graph is the necessary number of bits for describing completely the graph. This number of bits is usually given by the value I(S), the number of bits required to encode the substructure S. I(S) is computed as the sum of the number of bits to encode the vertices of S, the number of bits to encode the edges of S, and the number of bits to encode the adjacency matrix describing the graph connectivity of S. Subdue looks for the substructure S minimizing I(S) + I(G|S), where G is the input graph, I(S) is the number of bits required to encode the uncovered substructure, and I(G|S) is the number of bits required to encode the graph obtained by compressing G with S, i.e., substituting each occurrence of S in G by a single node (Holder, Cook, & Djoko, 1994).
In the following, we renamed the MDLi measure (’i’ stands for inverse) as we are maximizing its value: Subdue considers a given substructure S is better than another one S if the MDLi measure valueMDLi(S,G) is higher than valueMDLi(S,G), where valueMDLi(S,G) is computed as follows: valueMDLi(S,G) = I(G) / (I(S) + I(G|S))
However, the alternative operation mode for Subdue considers two distinct sets, a positive set Gp and a negative set Gn, determined by the user. In this operation mode, the goal of Subdue is to find the largest substructures present in the maximum number of graphs in the positive set, which are not included in the negative set. The MDLi measure is thus computed as follows:
valueMDLi(S,Gp,Gn) = (I(Gp) + I(Gn)) / (I(S) + I(Gp|S) + I(Gn) − I(Gn|S))
valueMDLi(S,Gp,Gn) = (I(Gp) + I(Gn)) / (I(S) + I(Gp|S) + I(Gn) − I(Gn|S))
The size of an object is not computed from the description length, but from an index based on either the number of nodes, the number of edges or, more usually, the sum of the both values. This measure is faster to compute but less consistent as it does not show the real benefit obtained after the compression of the DB.
valuesize(S,G) = Size(G) / (Size(S) + Size(G|S)) where, usually, Size(G) = #vertices(G) + #edges(G).
In the case of the second operation mode, in which we have a positive and a negative scientogram set, the Size measure is computed as follows:
valuesize(S,Gp,Gn) = (Size(Gp) + Size(Gn)) / (Size(S) + Size(Gp|S) + Size(Gn) − Size(Gn|S))
valuesize(S,Gp,Gn) = (Size(Gp) + Size(Gn)) / (Size(S) + Size(Gp|S) + Size(Gn) − Size(Gn|S))
The last alternative measure is based on the support of substructure S and it is expressed as follows:
valuesupport (S,G) = #graphs in G including S / card(G) with card(G) being the cardinal (cardinality ?) of the set of graphs G composing the DB.
valuesupport (S,G) = #graphs in G including S / card(G) with card(G) being the cardinal (cardinality ?) of the set of graphs G composing the DB.
For the second operation mode, this evaluation measure is computed as the sum of the number of positive maps containing S and the number of negative maps not containing S, divided by the total number of maps. Its formulation is as follows:
valuesupport (S,Gp,Gn) = (#Gp graphs including S + #Gn graphs not including S) / (card(Gp) + card(Gn))
valuesupport (S,Gp,Gn) = (#Gp graphs including S + #Gn graphs not including S) / (card(Gp) + card(Gn))
We consider a substructure having a larger positive support and a smaller negative support as having a better quality. In the same way, substructures having a larger size are preferred over smaller ones as they are more specific.
Study of the evolution of the scientific domain of a specific country over time
An information science expert would be interested in knowing which substructures appear in the analyzed domain, at which time, how big they are, how many they are, where are they located, and so forth. This will allow him to perform at least two kind of studies. On the one hand, an in-deep analysis of the uncovered substructures themselves, which kind of categories are they linking, etc. On the other hand, global statistics about the size and the quantity of these substructures to respectively characterize the importance of the evolution of the domain and its dynamics.
Thus, the goal of the first analysis task is to present a framework for the study of the evolution of a scientific domain over time using Subdue. ... As we want to look for CRCSs which were appearing at a given time, we also need to pick two ranges of years, the negative range and the positive range. The negative range is usually a set of years from the past, in which these substructures (i.e., CRCSs) are not meant to exist. The positive range is usually a set of years dated after the negative range, in which the substructures are meant to be present.
Identification of the common research categories substructures in the world
The aim of the second scientogram analysis task is to uncover the CRCSs in the world by analyzing the scientograms of a large number of different countries. ... All the selected maps representing the scientific production of those countries for that given year will be viewed as positive examples, so the goal of Subdue will be to extract the substructures with the best support among all of them. Notice that, no negative examples are considered in this case. As the user will be specially interested on the extracted CRCSs to be as specific as possible, the MDLi measure will be again considered to extract both frequent and large substructures.
Comparison of the scientific domains of different countries
To do so, the scientogram of each country in a given set is compared against the remaining ones in that set, the current country viewed as a positive map and the others as negative maps. ... Note that this experiment could also be done using time periods larger than a single year, or more than one country in the positive set each time, thus allowing an expert to extract the substructures highlighting the possible similarities between these countries.