科學認知映射(cognitive mapping of science)能將科學的結構(the structure of science)加以視覺化,起初應用於資訊服務(information services)、後來發現也可將其應用在科學政策(science policy)與研究評鑑(research evaluation),現在則有越來越多將其應用在發現新興與正在融合中的領域以及主題劃分(subject delineation)的改善。
科學認知映射主要可分為依據引用資訊、依據文本以及混合上述兩種資訊等三種方法。本研究利用文本與引用資訊混合的方法,將2002-2006年Web of Science資料庫內的期刊進行集群,以集群結果產生的認知映射,檢驗目前的期刊主題分類架構,如果可行的話,也將提出改善方式。
本研究首先評估ESI (Essential Science Indicators) 的22個領域主題分類架構,並且將其視覺化。圖1的左右分別是以交互引用與文本方式測量22個ESI領域的Silhouette值,從圖上發現生物學及生物化學(#2)、臨床醫學(#4)、工程學(#7)、植物及動物科學(#19)以及社會科學(#21)等領域上的期刊並沒有足夠好的一致性(coherent)。
從詞語的TF-IDF可以找出每個領域的描述詞語,而這裡也可發現不少領域的描述語互有重疊,例如工程學(#7)與電腦科學(#5)、化學(#3)與材料科學(#11)、植物及動物科學(#19)與環境/生態學(#8),以及生物學及生物化學(#2)、分子生物學及遺傳學(#14)與臨床醫學(#4)等等,此外,從描述社會科學(#21)的詞語也可以了解這個領域高度的異質性(heterogeneity)。
圖2則是以Pajek畫出22個ESI領域的結構圖,圖上也可發現生物學及生物化學(#2)與臨床醫學(#4)、化學(#3)與材料科學(#11)、電腦科學(#5)與工程學(#7)、環境/生態學(#8)與植物及動物科學(#19)等領域之間有很強的關連。
然後將約8300種期刊利用餘弦相似法及Wade的凝聚式階層集群演算法(Wade's agglomerative hierarchical cluster algorithm)進行集群,再比較集群結果與分類架構。值得說明的是文字部分的資訊可以提供集群結果的標示(labelling),而引用部分則可產生交互引用圖(cross-citation graph)提供視覺化,並且輸入PageRank演算法以決定代表性期刊。
決定集群的數目可以根據集群結果的品質,而集群品質有內在或外在驗證測量(internal or external validation measures)等兩種評估方式。內在驗證只考慮資料與集群的統計特性,例如dendrogram、Silhouette值與模組性(modularity)等;外在驗證需要將集群結果與一個已知的劃分標準進行比較,例如計算兩者間的Jaccard相似性 (Jaccard similarity)。本研究以dendrogram的視覺化方式將期刊首先分為三大群,再分為七群,最後分為22群,三大群約等於自然與應用科學(生物學、農學及環境科學;物理、化學及工程學;數學及電腦科學)、醫學與社會科學以及人文學。以TF-IDF描述語來看,七個群組中有三個屬於自然與應用科學,兩個是生命科學(生物科學及生物醫學與臨床、實驗醫學及神經科學)與兩個是社會科學以及人文學(經濟學、商學及政治學與心理學、社會學及教育學)
圖6是22個群組的結構圖,圖上可看到屬於社會科學以及人文學的群組(#1、#6、#14及#22與#9、#11及#21)、地理學、環境科學、生物學及農學(#2、#15及#19)、物理、化學及工程學(#4、#20及 #5)、數學及電腦科學(#8及#18)、生物科學及生物醫學(#3、#13及#16)與臨床、實驗醫學及神經科學(#7、#10、#12及#17)。
表3比較22個ESI領域與本研究利用引用、文本以及混合等方法產生的22個群組的集群品質,可以發現混合引用及文本資料在各種指標上幾乎都有最好的表現。
圖8是利用Jaccard指標比較集群結果與ESI架構的一致性(concordance)。
最後,本研究並分析期刊轉移(migration)的情形,也就是期刊不屬於原本ESI架構的領域所對應的集群,而被分配到另一個不同集群的現象,好的轉移(Good migration)能使分類的一致性增加,也就是Silhouette值或是模組性增加。本研究希望利用這個現象,從集群與領域的一致性的基礎上提出改善目前期刊主題分類架構的方法。
The main bibliometric techniques are characterised by three major approaches, particularly the analysis of citation links (cross-citations, bibliographic coupling, co-citations), the lexical approach (text mining), and their combination.
A hybrid text/citation-based method is used to cluster journals covered by the Web of Science database in the period 2002–2006. The objective is to use this clustering to validate and, if possible, to improve existing journal-based subject-classification schemes.
In a first step, the 22-field subject-classification scheme of the Essential Science Indicators (ESI) is evaluated and visualised. In a second step, the hybrid clustering method is applied to classify the about 8300 journals meeting the selection criteria concerning continuity, size and impact.
Moreover, the textual component of the hybrid method allows labelling the clusters using cognitive characteristics, while the citation component allows visualising the cross-citation graph and determining representative journals suggested by the PageRank algorithm.
Finally, the analysis of journal ‘migration’ allows the improvement of existing classification schemes on the basis of the concordance between fields and clusters.
The history of cognitive mapping of science is as long as the history of computerised scientometrics itself. While the first visualisations of the structure of science were considered part of information services, i.e., an extension of scientific review literature (Garfield, 1975, 1988), bibliometricians soon recognised the potential value of structural science studies for science policy and research evaluation as well. At present, the identification of emerging and converging fields and the improvement of subject delineation are in the foreground.
The main bibliometric techniques are characterised by three major approaches, particularly the analysis of citation links (cross-citations, bibliographic coupling, co-citations), the lexical approach (text mining), and their combination.
For instance, clustering based on co-citation and bibliographic coupling has to cope with several severe methodological problems. This has been reported, among others by Hicks (1987) in the context of cocitation analysis and by Janssens, Glänzel, and De Moor (2008) with regard to bibliographic coupling. One promising solution is to combine these techniques with other methods such as text mining (e.g., combined co-citation and word analysis: Braam, Moed, & Van Raan, 1991a; combination of coupling and co-word analysis: Small (1998); hybrid coupling-lexical approach: Janssens, Glänzel, & De Moor, 2007; Janssens et al., 2008).
Jarneving (2005) proposed a combination of bibliometric structure–analytical techniques with statistical methods to generate and visualise subject coherent and meaningful clusters. His conclusions drawn from the comparison with ‘intellectual’ classification were rather sceptical.
Despite several limitations, which will be discussed further in the course of the present study, cognitive maps proved useful tools in visualising the structure of science and can be used to adjust existing subject-classification schemes even on the large scale as we will demonstrate in the following.
The main objective of this study is to compare (hybrid) cluster techniques for cognitive mapping with traditional ‘intellectual’ subject-classifications schemes.
In a first study by authors related to the current work, the pilot study of Glenisson, Glänzel, & Persson (2005), further extended
and confirmed by Glenisson, Glänzel, Janssens et al. (2005), full-text analysis and traditional bibliometric methods
were serially combined to improve the efficiency of the individual methods. It was clear that clusters found through application
of text mining provided additional information that could be used to extend and explain structures found by bibliometric
methods, and vice versa. However, the integration was still limited to serial combination.
All textual content was indexed with the Jakarta Lucene platform (Hatcher & Gospodnetic, 2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme reviewed by Baeza-Yates & Ribeiro-Neto (1999). Stop words were neglected during indexing and the Porter stemmer was applied to all remaining terms from titles, abstracts, and keyword fields. The resulting term-by-document matrix contained nine and a half million term dimensions (9,473,061), but by ignoring all tokens that occurred in one sole document, only 669,860 term dimensions were retained. Those ignored terms with document frequency equal to one are useless for clustering purposes.
The dimensionality was further reduced from 669,860 term dimensions to 200 factors by Latent Semantic Indexing (LSI) (Berry, Dumais, & O’brien, 1995; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), which is based on the Singular Value Decomposition (SVD).
Text-based similarities were calculated as the cosine of the angle between the vector representations of two papers (Salton & Mcgill, 1986).
For simplicity and efficiency, the method used to summarise the subject of a field or cluster is based on selecting the terms with the highest mean TF-IDF weights over all journal papers in the field or cluster, where the IDF factor is calculated on the complete term-by-paper matrix (more than six million papers).
For example, Treeratpituk and Callan (2006) automatically select and assign a few concise labels to hierarchical clusters by combining statistical features from the cluster, parent cluster, and a corpus of general English into a descriptive score.
Geraci, Maggini, Pellegrini, and Sebastiani (2008) label clusters by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure, and by looking within the titles of Web pages for the substring that best matches the selected top-scoring words.
The similarities Sij used for clustering were found by calculating the cosine of the angle between the pair of vectors containing all symmetric journal cross-citation values between the two respective journals (i and j) and all other journals (i.e., row or column of the matrix C):
The journal cross-citation graph is also analysed to identify important high-impact journals. We use the PageRank algorithm (Brin & Page, 1998) to determine representative journals in each cluster. Besides, the graph can also be used to evaluate the quality of a clustering outcome.
In order to subdivide the journal set into clusters we used the agglomerative hierarchical cluster algorithm with Ward’s method (Jain & Dubes, 1988).
In general, the number of clusters is determined by comparing the quality of different clustering solutions based on various numbers of clusters. Cluster quality can be assessed by internal or external validation measures. Internal validation solely considers the statistical properties of the data and clusters, whereas external validation compares the clustering result to a known gold standard partition
This compound strategy encompasses observation of a dendrogram, text- and citation-based mean Silhouette curves, and modularity curves. Besides, the Jaccard similarity coefficient is used to compare the obtained results with an intellectual classification scheme.
Up to a multiplicative constant, modularity measures the number of intra-cluster citations minus the expected number in an equivalent network with the same clusters but with citations given at random. Intuitively, in a good clustering there are more citations within (and fewer citations between) clusters than could be expected from random citing.
In Fig. 8, we use the Jaccard index to compare each cluster with every field from the intellectual ESI classification, in order to detect the best-matching fields for each cluster.
Nowadays two ISI systems are widely used, in particular, the ISI Subject Categories, which are available in the JCR and through journal assignment in the Web of Science as well, and the Essential Science Indicators (ESI).
While the first system assigns multiple categories to each journal and is too fine grained (254 categories) for comparison with cluster analysis, the ESI scheme is forming a partition (with practically unique journal assignment) and the 22 fields are large enough. ... This subject-classification scheme is in principle based on unique assignment; only about 0.6% of all journals were assigned to more than one field over a 5-year period.
Fig. 1 presents the evaluation of the 22 ESI fields based on the cross-citation- (left) and text-based (right) Silhouette values (see Section 3.3.3). Since the ESI fields form a partition, this approach allows to evaluate their consistency as if the fields were results of a clustering procedure. Multi-, interand cross-disciplinarity of journals can certainly affect the results.
Several fields seem not to be coherent enough from both perspectives (i.e., the cross-citation and textual approach). Above all, the Silhouette values of field #2 (Biology and Biochemistry), #4 (Clinical Medicine), #7 (Engineering), #19 (Plant and Animal Science) and #21 (Social Sciences) substantiate that at least five of the 22 fields are not sufficiently coherent.
Simultaneously to the above validation, the textual approach also provides the best TF-IDF terms – out of a vocabulary of 669,860 terms – describing the individual fields. These terms are presented in Table 2. Although these terms already provide an acceptable characterisation of the topics covered by the 22 fields, considerable overlaps are apparent between pairs of fields, respectively: Engineering (#7) and Computer Science (#5), Chemistry (#3) and Materials Science (#11), Plant and Animal Science (#19) and Environment/Ecology (#8), as well as Biology and Biochemistry (#2), Molecular Biology and Genetics (#14) and Clinical Medicine (#4). In addition, the terms characterising the social sciences (#21) reflect a pronounced heterogeneity of the field.
The structural map of the 22 ESI fields based on cross-citation links is presented in Fig. 2. For the visualisation we used Pajek (Batagelj & Mrvar, 2003). The network map confirms the strong links we have found based on the best terms between fields #2 and #14, #3 and #11, #5 and #7, and #8 and #19, respectively.
In Table 3 we compare the quality of the partition of 22 ESI fields with the quality of the 22 clusters resulting from citation-based, text-based and hybrid clustering.
The cluster dendrogram shows the structure in a hierarchical order (see Fig. 4). We visually find a first clear cut-off point at three clusters, a second one around seven, and 22 clusters also seemed to be an acceptable/appropriate number.
The number of three clusters results in an almost trivial classification. Intuitively, these three high-level clusters should comprise natural and applied sciences, medical sciences, and social sciences and humanities.
The solution comprising of seven clusters results in a non-trivial classification. The best TF-IDF terms (see Table 5) show that three of these clusters represent the natural/applied sciences, whereas two classes each stand for the life sciences and the social sciences and humanities. This situation is also reflected by the cluster dendrogram in Fig. 4. A closer look at the best TF-IDF terms reveals that the social-sciences cluster (#1 of the 3-cluster solution) is split into the cluster #1 (economics, business and political science) and #6 (psychology, sociology, education), the life-science cluster (#3 in the 3-cluster scheme) is split into clusters #3 (biosciences and biomedical research) and #7 (clinical, experimental medicine and neurosciences) and, finally, the sciences cluster #2 of the 3-cluster scheme is distributed over three clusters in the 7-cluster solution, particularly, the cluster comprising biology, agriculture and environmental sciences (#2), physics, chemistry and engineering (#4) as well as mathematics and computer science (#5).
The social-sciences and humanities clusters form two groups that are each strongly interlinked; one consists of clusters #1, #6, #14 and #22 with focus on humanities, economics, business, political and library science, the other one comprises #9, #11 and #21 with sociology, education and psychology. This is in line with the hierarchical structure shown in Fig. 4. These two groups correspond to the two social-sciences clusters in the 7-cluster solution (cf. Section 4.4).
On the basis of the most important TF-IDF terms (see Table 6) we can assign clusters #2, #15 and #19 to geosciences, environmental science, biology and agriculture, which, in turn, form a larger group corresponding to the first of the three ‘‘megaclusters” in the 7-cluster solution.
These science clusters form two groups, #4, #20 and #5 form one group of chemistry, physics and engineering, while #8 and #18 form the third group comprising mathematics and computer science.
Here we have a biomedical and a clinical group. These two groups are in line with the hierarchical structure of the dendrogram in Fig. 4 but less clearly distinguished in the graphical network presentation (Fig. 6). Nonetheless, the terms provide an excellent description for at least some of the medical clusters: cluster #7 stands for the neuro- and behavioral sciences, #3 for bioscience, #10 for the clinical and social medicine, #13 microbiology and veterinary science, #12 non-internal medicine, #16 hematology and oncology and #17 cardiovascular and respiratory medicine. According to the dendrogram clusters 3, 13, 16 and clusters 7, 10, 12, 17 form one larger cluster each. On the basis of the best terms, we can characterise these groups as the bioscience–biomedical and the clinical and neuroscience group, respectively.
In this subsection we compare the structure resulting from the hybrid clustering with the ESI subject classification. This comparison is based on the centroids of the clusters and fields. The centroid of a cluster or field is defined as the linear combination of all documents in it and is thus a vector in the same vector space. For each cluster and for each field, the centroid was calculated and the MDS of pairwise distances between all centroids is shown in Fig. 7.
In Fig. 8, we use the Jaccard index to determine the concordance between our clustering solution and the ESI Scheme by comparing each cluster with every field, in order to detect the best-matching fields for each cluster. The darker a cell in the matrix, the higher the Jaccard index, and hence the more pronounced the overlap between the corresponding cluster and ESI field.
If clustering algorithms are adjusted or changed, one can observe the following phenomenon. Some units of analysis are leaving clusters they formerly belonged to and end up in different clusters. This phenomenon is called ‘migration’. We can distinguish between ‘good migration’ and ‘bad migration’.
‘Good migration’ is observed if the goodness of the unit’s classi- fication improves, otherwise we speak about ‘bad migration’. We can also apply this notion of migration to the comparison of clustering results with any reference classification. In the following we will use the ESI scheme as reference classification.
Out of 8305 journals under study, there were more than one third, namely, 3204 journals that were not assigned to the cluster which best matches their ESI field. As already mentioned above, we call these journals ‘migrated journals’.
‘Good migrations’ are observed if journals improved their Silhouette values after migration. Based on their titles and scopes (not shown), apparently they should indeed be assigned to the cluster to which they have moved.
Although the Silhouette and modularity values substantiate a more coherent structure of the hybrid clustering as compared with the ESI subject scheme, not all clusters are of high quality. Problems have been found, for instance, in clusters #1 and #12 where interdisciplinarity and strong links with other clusters distort the intra-cluster coherence.
All textual content was indexed with the Jakarta Lucene platform (Hatcher & Gospodnetic, 2004) and encoded in the Vector Space Model using the TF-IDF weighting scheme reviewed by Baeza-Yates & Ribeiro-Neto (1999). Stop words were neglected during indexing and the Porter stemmer was applied to all remaining terms from titles, abstracts, and keyword fields. The resulting term-by-document matrix contained nine and a half million term dimensions (9,473,061), but by ignoring all tokens that occurred in one sole document, only 669,860 term dimensions were retained. Those ignored terms with document frequency equal to one are useless for clustering purposes.
The dimensionality was further reduced from 669,860 term dimensions to 200 factors by Latent Semantic Indexing (LSI) (Berry, Dumais, & O’brien, 1995; Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990), which is based on the Singular Value Decomposition (SVD).
Text-based similarities were calculated as the cosine of the angle between the vector representations of two papers (Salton & Mcgill, 1986).
For simplicity and efficiency, the method used to summarise the subject of a field or cluster is based on selecting the terms with the highest mean TF-IDF weights over all journal papers in the field or cluster, where the IDF factor is calculated on the complete term-by-paper matrix (more than six million papers).
For example, Treeratpituk and Callan (2006) automatically select and assign a few concise labels to hierarchical clusters by combining statistical features from the cluster, parent cluster, and a corpus of general English into a descriptive score.
Geraci, Maggini, Pellegrini, and Sebastiani (2008) label clusters by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure, and by looking within the titles of Web pages for the substring that best matches the selected top-scoring words.
The similarities Sij used for clustering were found by calculating the cosine of the angle between the pair of vectors containing all symmetric journal cross-citation values between the two respective journals (i and j) and all other journals (i.e., row or column of the matrix C):
The journal cross-citation graph is also analysed to identify important high-impact journals. We use the PageRank algorithm (Brin & Page, 1998) to determine representative journals in each cluster. Besides, the graph can also be used to evaluate the quality of a clustering outcome.
In order to subdivide the journal set into clusters we used the agglomerative hierarchical cluster algorithm with Ward’s method (Jain & Dubes, 1988).
In general, the number of clusters is determined by comparing the quality of different clustering solutions based on various numbers of clusters. Cluster quality can be assessed by internal or external validation measures. Internal validation solely considers the statistical properties of the data and clusters, whereas external validation compares the clustering result to a known gold standard partition
This compound strategy encompasses observation of a dendrogram, text- and citation-based mean Silhouette curves, and modularity curves. Besides, the Jaccard similarity coefficient is used to compare the obtained results with an intellectual classification scheme.
Up to a multiplicative constant, modularity measures the number of intra-cluster citations minus the expected number in an equivalent network with the same clusters but with citations given at random. Intuitively, in a good clustering there are more citations within (and fewer citations between) clusters than could be expected from random citing.
In Fig. 8, we use the Jaccard index to compare each cluster with every field from the intellectual ESI classification, in order to detect the best-matching fields for each cluster.
Nowadays two ISI systems are widely used, in particular, the ISI Subject Categories, which are available in the JCR and through journal assignment in the Web of Science as well, and the Essential Science Indicators (ESI).
While the first system assigns multiple categories to each journal and is too fine grained (254 categories) for comparison with cluster analysis, the ESI scheme is forming a partition (with practically unique journal assignment) and the 22 fields are large enough. ... This subject-classification scheme is in principle based on unique assignment; only about 0.6% of all journals were assigned to more than one field over a 5-year period.
Fig. 1 presents the evaluation of the 22 ESI fields based on the cross-citation- (left) and text-based (right) Silhouette values (see Section 3.3.3). Since the ESI fields form a partition, this approach allows to evaluate their consistency as if the fields were results of a clustering procedure. Multi-, interand cross-disciplinarity of journals can certainly affect the results.
Several fields seem not to be coherent enough from both perspectives (i.e., the cross-citation and textual approach). Above all, the Silhouette values of field #2 (Biology and Biochemistry), #4 (Clinical Medicine), #7 (Engineering), #19 (Plant and Animal Science) and #21 (Social Sciences) substantiate that at least five of the 22 fields are not sufficiently coherent.
Simultaneously to the above validation, the textual approach also provides the best TF-IDF terms – out of a vocabulary of 669,860 terms – describing the individual fields. These terms are presented in Table 2. Although these terms already provide an acceptable characterisation of the topics covered by the 22 fields, considerable overlaps are apparent between pairs of fields, respectively: Engineering (#7) and Computer Science (#5), Chemistry (#3) and Materials Science (#11), Plant and Animal Science (#19) and Environment/Ecology (#8), as well as Biology and Biochemistry (#2), Molecular Biology and Genetics (#14) and Clinical Medicine (#4). In addition, the terms characterising the social sciences (#21) reflect a pronounced heterogeneity of the field.
The structural map of the 22 ESI fields based on cross-citation links is presented in Fig. 2. For the visualisation we used Pajek (Batagelj & Mrvar, 2003). The network map confirms the strong links we have found based on the best terms between fields #2 and #14, #3 and #11, #5 and #7, and #8 and #19, respectively.
In Table 3 we compare the quality of the partition of 22 ESI fields with the quality of the 22 clusters resulting from citation-based, text-based and hybrid clustering.
The cluster dendrogram shows the structure in a hierarchical order (see Fig. 4). We visually find a first clear cut-off point at three clusters, a second one around seven, and 22 clusters also seemed to be an acceptable/appropriate number.
The number of three clusters results in an almost trivial classification. Intuitively, these three high-level clusters should comprise natural and applied sciences, medical sciences, and social sciences and humanities.
The solution comprising of seven clusters results in a non-trivial classification. The best TF-IDF terms (see Table 5) show that three of these clusters represent the natural/applied sciences, whereas two classes each stand for the life sciences and the social sciences and humanities. This situation is also reflected by the cluster dendrogram in Fig. 4. A closer look at the best TF-IDF terms reveals that the social-sciences cluster (#1 of the 3-cluster solution) is split into the cluster #1 (economics, business and political science) and #6 (psychology, sociology, education), the life-science cluster (#3 in the 3-cluster scheme) is split into clusters #3 (biosciences and biomedical research) and #7 (clinical, experimental medicine and neurosciences) and, finally, the sciences cluster #2 of the 3-cluster scheme is distributed over three clusters in the 7-cluster solution, particularly, the cluster comprising biology, agriculture and environmental sciences (#2), physics, chemistry and engineering (#4) as well as mathematics and computer science (#5).
The social-sciences and humanities clusters form two groups that are each strongly interlinked; one consists of clusters #1, #6, #14 and #22 with focus on humanities, economics, business, political and library science, the other one comprises #9, #11 and #21 with sociology, education and psychology. This is in line with the hierarchical structure shown in Fig. 4. These two groups correspond to the two social-sciences clusters in the 7-cluster solution (cf. Section 4.4).
On the basis of the most important TF-IDF terms (see Table 6) we can assign clusters #2, #15 and #19 to geosciences, environmental science, biology and agriculture, which, in turn, form a larger group corresponding to the first of the three ‘‘megaclusters” in the 7-cluster solution.
These science clusters form two groups, #4, #20 and #5 form one group of chemistry, physics and engineering, while #8 and #18 form the third group comprising mathematics and computer science.
Here we have a biomedical and a clinical group. These two groups are in line with the hierarchical structure of the dendrogram in Fig. 4 but less clearly distinguished in the graphical network presentation (Fig. 6). Nonetheless, the terms provide an excellent description for at least some of the medical clusters: cluster #7 stands for the neuro- and behavioral sciences, #3 for bioscience, #10 for the clinical and social medicine, #13 microbiology and veterinary science, #12 non-internal medicine, #16 hematology and oncology and #17 cardiovascular and respiratory medicine. According to the dendrogram clusters 3, 13, 16 and clusters 7, 10, 12, 17 form one larger cluster each. On the basis of the best terms, we can characterise these groups as the bioscience–biomedical and the clinical and neuroscience group, respectively.
In this subsection we compare the structure resulting from the hybrid clustering with the ESI subject classification. This comparison is based on the centroids of the clusters and fields. The centroid of a cluster or field is defined as the linear combination of all documents in it and is thus a vector in the same vector space. For each cluster and for each field, the centroid was calculated and the MDS of pairwise distances between all centroids is shown in Fig. 7.
In Fig. 8, we use the Jaccard index to determine the concordance between our clustering solution and the ESI Scheme by comparing each cluster with every field, in order to detect the best-matching fields for each cluster. The darker a cell in the matrix, the higher the Jaccard index, and hence the more pronounced the overlap between the corresponding cluster and ESI field.
If clustering algorithms are adjusted or changed, one can observe the following phenomenon. Some units of analysis are leaving clusters they formerly belonged to and end up in different clusters. This phenomenon is called ‘migration’. We can distinguish between ‘good migration’ and ‘bad migration’.
‘Good migration’ is observed if the goodness of the unit’s classi- fication improves, otherwise we speak about ‘bad migration’. We can also apply this notion of migration to the comparison of clustering results with any reference classification. In the following we will use the ESI scheme as reference classification.
Out of 8305 journals under study, there were more than one third, namely, 3204 journals that were not assigned to the cluster which best matches their ESI field. As already mentioned above, we call these journals ‘migrated journals’.
‘Good migrations’ are observed if journals improved their Silhouette values after migration. Based on their titles and scopes (not shown), apparently they should indeed be assigned to the cluster to which they have moved.
Although the Silhouette and modularity values substantiate a more coherent structure of the hybrid clustering as compared with the ESI subject scheme, not all clusters are of high quality. Problems have been found, for instance, in clusters #1 and #12 where interdisciplinarity and strong links with other clusters distort the intra-cluster coherence.
沒有留言:
張貼留言