information visualization
一般的科學映射流程 [7] 包含以下的五個步驟:1) 選擇適合的資料來源;2) 選擇分析的單位,並從選擇的來源中抽取需要的資料;3) 選擇適合的相似程度(similarity)測量方式,並計算相似值;4) 利用定位(ordination)與叢集(clustering)演算法產生資料映射圖;5)根據映射圖結果進行探索,以解答研究的問題。過去的期刊映射研究大多使用期刊共被引資料的Pearson相關係數做為期刊間相似程度的計算方式,並利用多維尺度法(multidimensional scaling, MDS)進行定位來產生圖形[11~16]。Leydesdorff [17,18]也使用多維尺度法作為映射方法,但是他使用期刊之間的引用資料測量期刊間的相似程度。Leydesdorff 也曾經進一步應用期刊之間引用資料的Pearson相關係數測量期刊間的相似程度,並且使用Pajek程式 [27]分別繪製SCI和SSCI期刊的映射圖[22, 23]。此外,Campanario [19]利用自組織映射圖(self-organizing map)作為定位的方式。Tijssen and van Leeuwen [20]則利用期刊內容映射(journal content mapping),使得他們的研究可以包含非ISI資料庫內的期刊。
本研究對7121種ISI的SCI和SSCI資料庫裡的期刊,產生代表所有科學結構的映射圖,分析產生圖形的結構準確性(structural accuracy),同時也探討映射結果的局部準確性(local accuracy),後者所指的是屬於相同次學科(subdiscipline)的期刊能夠被群聚在一起(Klavans and Boyark, 2006),而前者所指的是彼此相互引用的期刊群聚在映射到圖形上時也有相互鄰近,也就是本研究所認為的科學的主幹(backbone of science)。
在這項研究中,利用VxOrd [32]演算法進行資料定位,以k-means法進行叢集,以八種方式測量期刊之間的相似程度,其中五種是利用期刊之間的引用(inter-citation)資料為基礎,包括原始次數、Cosine指標、Jaccard指標、Pearson相關係數和由Pudovkin and Garfield [25]提出的平均關連因數(average relatedness factor);另外三種是以共被引(cocitation)資料為基礎,包括原始資料、Pearson相關係數和作者等人[1]提出的K50指標。
本研究利用期刊在ISI的主題分類(subject categories)評估與比較各種相似程度測量方式所產生的映射圖結果。首先在局部準確性的評估上,本研究認為某一對期刊間如果具有較高的相似程度便應在同一主題分類中,映射圖上同一叢集的期刊也應屬於同一主題分類。[1]的研究結果發現經過VxOrd 演算法進行資料定位後可以增加局部準確性,並且以期刊之間的引用資料為基礎並經過正規化的四種指標表現相當,而以Cosine指標為最佳。本研究則以Gibbons and Roth [37] 所提出的交互資訊(mutual information)的檢測法來比較各種映射圖結果的結構準確度,同樣以ISI的205個主題分類為參考的分類標的。檢測的結果除了共被引的原始資料的分群結果不理想之外,其餘的各種相似程度測量方式所得的分群結果大致相當,並且以期刊之間引用資料為基礎的Pearson相關係數得到最好的結果,但是在200到250個分群時,以期刊之間引用資料為基礎的Jaccard指標也有相當的分類結果。
最後以局部準確性、結構準確性、規模可擴展性(scalability)和叢集結果的可判讀性(readability)對以期刊之間引用資料為基礎的五種方式和以期刊之間共被引資料為基礎的三種方式分別進行比較。以期刊之間共被引資料為基礎的三種方式而言,利用K50指標為相似程度測量方式所得到的結果在結構準確性上與Pearson相關係數所得到的結果大約相當,但前者具有較好的規模可擴展性和局部準確性,並且其呈現的結果在叢集大小與分布位置上都較為均衡。另一方面,在以期刊之間引用資料為基礎的五種方式,Cosine指標、Jaccard指標、Pearson相關係數等三種在規模可擴展性和可判讀性較其他兩種為佳,並且Jaccard指標所得到的映射圖有較高的結構準確性。
最後以局部準確性、結構準確性、規模可擴展性(scalability)和叢集結果的可判讀性(readability)對以期刊之間引用資料為基礎的五種方式和以期刊之間共被引資料為基礎的三種方式分別進行比較。以期刊之間共被引資料為基礎的三種方式而言,利用K50指標為相似程度測量方式所得到的結果在結構準確性上與Pearson相關係數所得到的結果大約相當,但前者具有較好的規模可擴展性和局部準確性,並且其呈現的結果在叢集大小與分布位置上都較為均衡。另一方面,在以期刊之間引用資料為基礎的五種方式,Cosine指標、Jaccard指標、Pearson相關係數等三種在規模可擴展性和可判讀性較其他兩種為佳,並且Jaccard指標所得到的映射圖有較高的結構準確性。
This paper presents a new map representing the structure of all of science, based on journal articles, including both the natural and social sciences. ... Eight alternative measures of journal similarity were applied to a data set of 7,121 journals covering over 1 million documents in the combined Science Citation and Social Science Citation Indexes. For each journal similarity measure we generated two-dimensional spatial layouts using the force-directed graph layout tool, VxOrd.
By accuracy, we mean that journals within the same subdiscipline should be grouped together, and groups of journals that cite each other should be proximate to each other on the map. The first results from this effort, dealing with local accuracy, appeared recently. By contrast, this paper focuses on structural accuracy and characterization of the map defining the structure or backbone of science.
Published journal-based maps have typically been focused on single disciplines, and have used a Pearson correlation on co-citation counts with multidimensional scaling (MDS). [11~16] Other discipline-level studies not using the Pearson/MDS technique include the use of relative inter-citation counts with MDS by Leydesdorff [17,18], the use of a self-organizing map by Campanario [19], and the work by Tijssen and van Leeuwen to include non-ISI journals in their maps using journal content mapping. [20]
Leydesdorff has used the 2001 JCR data to map 5,748 journals from the Science Citation Index (SCI) [22] and 1,682 journals from the Social Science Citation Index (SSCI) [23] in two separate studies. In both studies Leydesdorff uses a Pearson correlation on citing counts as the edge weights and the Pajek program for graph layout, progressively lowering thresholds to find articulation points (i.e., single points of connection) between different network components. These network components are his journal clusters. The only potential drawback to this solution is that as thresholds are lowered, newly identified small components (presumably two or three journals each) are dropped from the solution space, so that the total number of journals comprising Leydesdorff's clusters is substantially less than the number in the original set.
An alternative to using journals to map the structure of science has recently been investigated by Moya-Anegón and associates [9] to good effect. Using 26,062 documents with a Spanish address from the year 2000 as a base set, they used co-cited ISI category assignments to create category maps. Their highest level map shows the relative positions, sizes and relationships between 25 broad categories of science in Spain.
The general process followed by most practitioners for creating knowledge domain maps has been explained in detail elsewhere. [7] This process can vary slightly depending upon the specific research question, but typically contains the following steps: 1) selection of an appropriate data source, 2) selection of a unit of analysis (e.g. paper, journal, etc.) and extraction of the necessary data from the selected source, 3) choice of an appropriate similarity measure and calculation of similarity values, 4) creation of a data layout using a clustering or ordination algorithm, and 5) exploration of the map based on the data layout as a means of answering the original research questions. Here, we add another step after 4) - statistical validation - that allows us to choose the similarity measure that produces the most accurate map.
Based on these considerations, we obtained the complete set of 1.058 million records from 7,349 separate journals from the combined SCI and SSCI files for the year 2000. Of the 7,349 journals, analysis was limited to the 7,121 journals that appeared as both citing and cited journals. ... Journal inter-citation frequencies were directly counted from the citing and cited journal information in these 16.24 million reference pairs. The resulting journal-journal inter-citation frequency matrix was extremely sparse (98.6% of the matrix has zeros). ... While there was a great deal more cocitation frequency information, the journal-journal co-citation frequency matrix was also sparse (93.6% of the matrix has zeros).
For the purpose of map validation we also retrieved the ISI journal category assignments. For the combined SCI and SSCI, there were a total of 205 unique categories. Including multiple assignments, the 7,121 journals were assigned to a total of 11,308 categories, or an average of 1.59 categories per journal.
The five inter-citation measures include one unnormalized measure, raw frequency (IC-Raw); and four normalized measures, Cosine (IC-Cosine), Jaccard (IC-Jaccard), Pearsons r (IC-Pearson), and the recently introduced average relatedness factor of Pudovkin and Garfield [25] (IC-RFavg).
The three co-citation measures include one unnormalized measure, raw frequency (CC-Raw); the vector-based Pearsons r (CC-Pearson), and a new normalized frequency measure [1] that we call K50 (CC-K50). This new measure, K50, is simply a cosine-type value minus an expected cosine value. Ei,j is the expected value of Fi,j, and varies with the row sum, Sj, thus K50 is asymmetric and Eij<>Eji . Subtraction of an expected value component tends to accentuate higher than expected relationships between two small journals or between a small and a large journal, and discounts lower than expected relationships between large journals. We thus expect the K50 measure to do a better job than other measures of accurately placing small journals, and to reduce the influence of large and multidisciplinary journals on the overall map structure.
The most commonly used reduction algorithm is multidimensional scaling; however, its use has typically been limited to data sets on the order of tens or hundreds of items.
Factor analysis is another method for generating measures of relatedness. In a mapping context, it is most often used to show factor memberships on maps created using either MDS or pathfinder network scaling, rather than as the sole basis for a map. Yet, factor values can be used directly for plotting positions. For instance, Leydesdorff [23] directly plotted factor values (based on citation counts) to distinguish between pairs of his 18 factors describing the SSCI journal set.
Layout routines capable of handling these large data sets include Pajek, [27] which has recently been used on data sets with several thousand journals by Leydesdorff, [22,23] and which is advertised to scale to millions of nodes; self-organizing maps, [28] which can scale, with various processing tricks, to millions of nodes, [29] and the bioinformatics algorithm LGL, [30] capable of dealing with hundreds of thousands of nodes, which uses an iterative layout as well as data types and algorithms from the Boost Graph Library. [31]
We chose to use VxOrd, [32] a force-directed graph layout algorithm, over the other algorithms mentioned, for several reasons. VxOrd improves on a traditional force-directed approach by employing barrier jumping to avoid trapping of clusters in local minima, and a density grid to model repulsive forces. Because of the repulsive grid, computation times are order O(N) rather than O(N(square)), allowing VxOrd to be used on graphs with millions of nodes. VxOrd also applies edge cutting criteria, which leads to graph layouts exhibiting both local (orientation within groups) and global (group-to-group) structure. The combination of the initial node and edge structure and cutting criteria thus determine the number, size, shape, and position of natural groupings of nodes.
Validation of science maps is a difficult task. In the past, the primary method for validating such maps has been to compare them with the qualitative judgments made by experts, and has been done only for single-discipline-scale maps (see the background section of Klavans & Boyack [1] for more discussion).
A more pragmatic approach is to use the ISI journal classifications to evaluate the validity of the journal similarity measures and the corresponding maps. The ISI journal classification system, while it does have its critics, is based on expert judgment and is widely used. In principle, users would expect that pairs of journals with high similarity should be in the same ISI category. Journals in the same cluster of a journal mapping should have the same ISI category assignments. These assumptions are used to validate and compare the eight different similarity measures and corresponding graph layouts or maps.
In our previous work with the current data set, and the same eight similarity measures and maps from Figure 1, we investigated local accuracy and the effects on accuracy of reducing dimensionality with VxOrd [1] using the ISI category assignments as a reference basis. We found that, counterintuitively, use of VxOrd algorithm to convert similarities to map positions actually increased local accuracy. We also found that four of the inter-citation measures had roughly comparable local accuracy at 95% journal coverage, and recommended the IC-Cosine measure as the best overall measure.
In this work we focus on structural accuracy or the validity of the global structure of the solution space. To make quantitative comparisons of our eight maps of science, we implement a mutual information method recently used to distinguish between gene clustering algorithms. [37] This mutual information method requires a reference basis, for which we use the ISI journal category assignments.
To employ the method of Gibbons and Roth [37] we need to do a clustering of each of the maps. VxOrd gives (x,y) coordinate positions for each node, but does not assign cluster numbers to the nodes. Thus, k-means clustering was applied to each of the maps in Figure 1. Other clustering methods (e.g. linkage or density-based clustering) could have been used.
The CC-Raw map clearly performs the worst. The Z-scores for all other measures are near or above a value of 350, indicating that all of these measures give maps that are far from random. The IC-Pearson map gives the highest Z-score over nearly the entire range of cluster solutions. It is only at the higher end, from 200 through 250 clusters, that the IC-Jaccard map has a Z-score comparable to that of the IC-Pearson.
Hence, based on Z-scores it is likely that any of the six would be a suitable choice as the basis for an accurate map of science.
For a co-citation-based map, the CC-K50 measure is a clear winner for several reasons. Although the Z-score for the CC-K50 is nearly identical to that of the CC-Pearson, the K50 measure is scalable to much larger numbers of nodes, while the Pearson is a full N(square) calculation, and cannot easily scale much higher than the 7000 nodes used here. The CC-K50 map is a visually well-balanced map with a good distribution of cluster sizes and positions (see Figures 1 and 3). By contrast, the CC-Pearson map appears very stringy; clusters are very dense with less visual differentiation between disciplines, and thus not as suitable for presentation. The CC-K50 map also has a higher degree of local accuracy. [1]
Of these three, IC-Cosine, IC-Jaccard, and IC-Pearson, we choose to further characterize the IC-Jaccard as our best map due to its slightly higher Z-score, realizing that the Cosine map is in a virtual dead heat statistically, and the Pearson map only somewhat less in local accuracy.
Differences such as these between the maps at the discipline level are likely due to fine-scaled differences between the co-citation and inter-citation patterns. Yet, the overall consistency between the co-citation and inter-citation-based maps of science suggests the general structure described here is robust.
Figure 6 shows the clear distinction between two main areas within the LIS discipline. Although there are relationships between journals in the two clusters, the dominant relationships (darkest edges) are within clusters. The journals in the cluster at the upper left all focus on libraries and librarians and their work, while those in the cluster at the lower right are all focused on advances in information science.
Eight different similarity measures were calculated from the combined SCI/SSCI data and the resulting journal-journal similarity matrices were mapped using VxOrd. The eight maps were then compared based on two different accuracy measures, the scalability of the similarity algorithm, and the readability of layouts (clustering).
沒有留言:
張貼留言