information visualization
本研究以h-similarity(Schubert 2010)做為評估期刊間相似程度的方式,利用Walktrap Community Findig (WCF) 演算法 (Pons and Latapy 2005)將期刊進行叢集,產生的結果能夠代表科學領域的結構,因此將叢集的結果與ESI(Essential Science Indicator)資料庫的主題分類相比較。
h-similarity的計算是以期刊的引用集合為基礎,利用h-index的觀念,找出引用此期刊前h名並且引用次數超過h次的期刊做為期刊的h-核心(h-core)。兩種期刊間的相似程度便是利用屬於它們的h-core進行Jaccard指標(Jaccard index)的計算結果進行估算,其值在1與0之間。當兩種期刊的h-core完全不同時,它們之間的h-similarity為0;兩種期刊的h-core相同的期刊愈多,h-similarity的值就愈高。本研究所使用的WCF演算法(Pons and Latapy 2005)是一種以模組性(modularity)最佳化(Girvan and Newman 2002; Newman and Girvan 2004; Newman 2006)的社群偵測(community detection)演算法,叢集的結果可以使得在集群之間有密度較高的連結並且這些連結的強度也較強。
以上述的相似程度的方式和演算法對2006年SCI JCR資料庫接近6000種期刊進行叢集的結果,共找出61個集群(稱為h-cluster),而這些集群內包含的期刊數目差異相當大,包含期刊數最多的兩個h-cluster分別有1214和904種期刊,而這兩個集群內期刊的主題為生物科技(biotechnology)和臨床醫學(clinical medicine)。本研究利用ESI的主題分類檢視每一個h-cluster內包含期刊所屬的主題分類,並且對集群內的期刊測量它們在集群上的中介中心性(betweenness centrality),以中介中心性最高的期刊做為此h-cluster的代表。結果發現h-cluster內包含的期刊大多屬於同一個主題分類之下。從另一個方向檢視每一個主題分類內包含的h-cluster則發現,如果某些h-cluster彼此都包含於同一個主題分類時,這些主題分類通常是較廣泛的主題,例如工程(engineering)。另外,本研究也將兩種分類系統作為期刊的特徵進行對應分析(correspondence analysis),以便將主題分類和h-cluster的結果視覺化,結果可以發現相關的分類會映射到鄰近的區域。
Journals covered by the 2006 Science Citation Index Journal Citation Reports database have been subjected to a clustering procedure utilizing h-similarity as the underlying similarity measure.
In this paper an attempt is made to use this h-similarity measure for clustering science journals, and thereby to gain a structural map of science fields.
The h-similarity between journals A and B has been defined through their ranked cited journal list. The h-core of a ranked cited journal list is the largest top h-element list with each item having at least h citations. Then we can define the h-similarity of two journals as
hA,B = (HA "and" HB) / (HA "or" HB)
where hA,B is the h-similarity between journals A and B, the sets HA and HB are the h-cores of A and B, respectively, "and" denotes the intersection, "or" denotes the union and || denotes the cardinality of sets.
The h-similarity is a Jaccard-type similarity measure with a range [0, 1] that has value 0 if and only if the two h-cores have no common elements and has value 1 if and only if the two h-cores contain identical elements in whatever order.
The set of journals covered by the 2006 Science Citation Index Journal Citation Reports (SCI JCR 2006) database, with approximately 6000 elements, was subjected to a clustering procedure (or, more precisely, a community detection exercise) utilizing h-similarity as the underlying similarity measure.
Beyond classification of journals into fields and subfields, the auxiliary aim of the experiment was to explore the relationship of this novel classification to existing ones, of which the most straightforward and first example is the system of subject categories used by the Essential Science Indicators (ESI) database of Thomson–Reuters (formerly, ISI).
The ESI system (unlike several other classification scheme used in Thomson–Reuters databases) does not allow multiple classification, so each journal is assigned to exactly one of the 22 categories.
Instead of a direct comparison of h-clusters and ESI-fields, we attempted to characterize the emerging clusters in terms of ESI categories and vice versa, providing also an interpretation of them in terms of each other.
As the first step, we computed the pairwise h-similarities of the 6000 titles. The resulting proximity matrix, or, rather, the list of weighted journal pairs was conceived as the edgelist of a similarity graph, i.e. a weighted graph expressing the similarity pattern of the domain.
The basic idea w.r.t. partitioning the domain to journal sets representing fields and subfields was then to find subgraphs (communities) in this large network of journals that are (1) dense (in the sense that most members are similar to most other ones) and (2) strongly connected (meaning a high sum of h-similarity values).
To this end, a community detection method was chosen to be applied on the large network of titles that took into account edge weights (h-similarity values) of the graph. We used the Walktrap Community Finding (WCF) algorithm (Pons and Latapy 2005) as implemented in the igraph R package (R Development Core Team 2009) by Pons and Csardi (no date) and Csardi and Nepusz (2006), that attempts to find dense subgraphs by random walks.
The WCF algorithm works in an agglomerative fashion, starting with the strongest communities and merging the closest ones in consecutive steps until the whole network is reconstructed. This procedure allowed us to select particular levels of agglomeration, yielding a hierarchical classification, according to some optimization criteria. For optimization we used the modularity function of Newman and Girvan (Girvan and Newman 2002; Newman and Girvan 2004; Newman 2006):
Using this measure as the object function to be maximized, we selected the community structure of highest modularity as the level of aggregation to be evaluated and compared to the ESI system.
The value of the modularity function reached its single maximum—after a slow and gradual increase and before a sudden fall— at a cluster structure with 61 clusters.
Due to the agglomerative nature of the process, it was the most inclusive level of aggregation, representing the general fields within this context.
The size distribution (Fig. 1) resulted from the solution providing 61 fields, being rather asymmetric and skewed, indicates two ‘‘giant’’ fields (with 1214 and 904 titles), a few extensive fields (between 200 and 400 journals) and many middle-sized or small categories (with less than 200 journals).
In the first cluster (n = 1214) clinical medicine, biology and biochemistry, molecular biology/genetics and pharmacology/toxicology jointly account for more than 60% of the content, on the basis of which it can be considered a fairly coherent field of biotechnology.
The second cluster has an even more explicit orientation, since it is clearly dominated by clinical medicine (with more than 80% share among ESI areas).
In general, we can state that, in all cases, h-clusters emerged as rather coherent fields in terms of ESI categories.
Though cluster profiles showed somewhat varying distributions, from one heavily contributing ESI field (like in the case above) to a couple of characteristic categories, the top contributors, also covering the majority of the journals included, consistently identified the meaning of each cluster.
The exercise showed that, from the perspective of h-similarity, ESI fields are quite broad or diverse categories in many cases.
To quantify the h-similarity structure of ESI categories, we used the well-known Shannon index, the latter being a well-comprehensible measure of the position of a category on the uniformity–heterogeneity scale: the more heterogeneous a category is, the higher is the value of the Shannon index (also called entropy or diversity depending on the context of application).
Engineering is the top rated, or the most diverse ESI field; computer science, social sciences in general and the field of economics and business are the rest in the top four. On the other extreme, microbiology and immunology behave as monolithic: in general, the life sciences and chemistry tend to bear a value below the average diversity.
In order to capture all of the relations discussed so far in an expressive manner, we conducted a correspondence analysis (CA) upon the cross-tabulation of h-clusters and ESI fields as journal features or variables. In this way, by modelling h-clusters as ESI-profile points and vice versa, CA makes it possible to explore (1) the position of h-clusters relative to each other as determined by the ESI contributions, (2) the relative position of ESI fields based upon h-fields and also—with some limitations—(3) the relation between the two partitions.
It is most striking from the results, as shown in Fig. 3, that field-points in both taxonomy are clearly separated and organized into well-readable regions: the bottom right quadrant constitutes the biomedical life sciences, the upper right quadrant is for the environmental sciences. There is a relatively standalone point for chemistry, that is in accord with our previous observations concerning the field. On the left side, along the horizontal reference line lie the physical and engineering sciences (both applied and general), and the bottom left quadrant provides space for traditionally mathematically oriented social sciences (economics and business). In this corner, as a distant point, we can also find mathematics. (Point sizes are proportional to the mass of the corresponding point/field.)
In order to refine the interpretation of h-fields, the core journals within each subgraphs of titles have been identified. The core journal of a h-field in the present context was understood as a representative member of the class, to which the majority of cluster members are similar, that is, which bears the most extensive similarity-based kinship in the cluster. We called these titles the ‘‘prototypes’’ of the h-fields.
The procedure of tracing prototypes, then, consisted of the following steps.
(1) In the first step, to clarify and sharpen similarity patterns, we omitted from each cluster the edges with a weight (similarity) value below a selected threshold. This cutoff parameter was set to 0.2.
(2) In these clarified clusters betweenness centrality scores were calculated for each element, and the title with maximum score (in the unambiguous case) was qualified as the core journal.
(3) For each h-cluster, sub-clusters were identified utilizing the hierarchical nature of our classification procedure. In particular, we picked a second (lower) level of the hierarchy based on the distribution of modularity values that yielded approximately 200 clusters. Steps (1)–(2) were repeated at this level, resulting in a core journal for each of the 200 clusters. Sub-clusters covered by h-clusters, then, provided a set of
further representative titles for each h-cluster at a more detailed level of the classification.
沒有留言:
張貼留言