information visualization
本研究探討文件間的相似性測量方式,依據直接比較兩個物件間被引用分布的相似性或是在透過兩個物件比較它們之間的相似性,稱為一次相似性(first-order similarity)和二次相似性(first-order similarity)。本研究針對AIM的58885筆文件的引用情形,以類似書目耦合(bibliographic coupling)的概念計算文件間的一次相似性,再利用一次相似性產生的矩陣計算二次相似性。比較相似性測量方式的時候,對文件建立相似性網絡,每一筆文件對應到網絡上的一個節點,兩種相似性測量的資料用來作為網絡上連結的權重,然後對於網絡上的節點進行叢集。以 Generalized Jensen-Shannon divergence (GJSD) (Lin, 1991)測量叢集內文件的詞語分布均衡情形,評估叢集結果的品質。結果在不同的解析度(叢集的數目)下,二次相似性的結果都比一次相似性的結果顯著來得好。本研究認為二次相似性測量方式的結果較好的原因是由於二次相似性能夠解決兩筆文件沒有直接相似性,但分別與第三者相似的情形,因此在文件相似性網絡上,彼此相似的文件間的權重都較大,也就是具有較高的遞移性(transitivity),使其網絡的叢集係數較大,較能夠發現文件叢集;而且二次相似性在測量文件間的相似性時利用較充分的資料來進行計算,減少統計上的不確定性。
In this study, we have dealt with the issue of similarity order regarding the measurement of document–document similarity. We used a large dataset of 58,885 articles from AIM, and compared experimentally first-order with SO with respect to overall quality, in terms of textual coherence, of partitions of the dataset. The partitions where obtained on the basis of optimizing weighted modularity.
Basically, there are two approaches to the measurement of similarity between two objects: the local (direct) approach and the global (indirect) approach (Ahlgren et al. 2003; van Eck and Waltman 2009).
The local approach focuses on the direct similarity between the two objects, while the global one focuses on the way the two objects relate to other objects in dataset under study.
In the global approach, the similarity between two objects is obtained by measuring the similarity between their profiles, vectors that often contain the number of co-occurrences (eventually normalized) of an object with each other considered object. However, the components of such vectors might refer to other entities than co-occurrences.
A number of papers in the scientometric literature report outcomes of comparisons of similarity measures or similarity approaches. Outcomes of empirical comparisons are reported by Boyack et al. (2005, 2011), Boyack and Klavans (2010), Ahlgren and Colliander (2009a, b), Ahlgren and Jarneving (2008), Leydesdorff (2008), Klavans and Boyack (2006), Gmu¨r (2003), Luukkonen et al. (1993), and Peters and Van Raan (1993). Other studies report outcomes of theoretical comparisons (Egghe 2009, 2010a, b; Egghe and Rousseau 2006; Hamers et al. 1989). In Van Eck and Waltman (2009), and in Egghe and Leydesdorff (2009), outcomes of both empirical and theoretical comparisons are reported.
The issue of comparison of similarity measures is discussed at length by Schneider and Borlund (2007a, b) in an author cocitation analysis context, though. These authors remark that the transformation of a symmetric proximity matrix X into a corresponding matrix Y, where the values have been derived from pairs of columns (or rows), i.e., profiles, in X, is unconventional from the point of view of traditional approaches to multivariate analysis (Schneider and Borlund 2007a). The authors point to the issue of what to put on the diagonal in X. Van Eck and Waltman (2009) agree that the indicated approach is unconventional, but do not believe that the approach has any fundamental statistical problems, and give examples of global similarity measures with known good properties.
For an article, in the resulting set, to be included in study, the article should have (a) an abstract, and (b) at least five cited references. The number of articles that satisfy conditions (a) and (b) is 58,885, and thereby 58,885 articles are included in the study.
We use the bibliographic coupling approach for the measurement of document–document similarity. It has recently been suggested that among the citation-based approaches— bibliographic coupling, cocitation and direct citation—bibliographic coupling tends to give the most accurate partitions in terms of textual coherence (Boyack and Klavans 2010).
With the updated list at our disposal, we constructed a reference-by-article matrix, where the rows correspond to the obtained citation keys. However, it was decided to weight the references. We believe it is reasonable to weight a reference in accordance with how frequently the reference occurs in the document collection under study. A reference that occurs in a large proportion of the documents in the collection, like a paper treating a method with a large application area, should normally be a bad indicator of similarity between two documents in which it occurs. We used the inverse document frequency (idf) approach (Salton and Buckley 1988), employed for term weighting in information retrieval.
Let A be the constructed (non-binary) reference-by-article matrix. From A, a 58,885 times 58,885 similarity matrix, B, was derived, populated with first-order similarity values as defined in (Ahlgren and Colliander 2009a):
From B, a second 58,885 times 58,885 similarity matrix, C, was derived, populated with second-order similarity values as defined in (Ahlgren and Colliander 2009a):
It has been argued that in the context of the local approach to similarity, when the initial matrix (for example, a reference-by-article matrix) has binary data, the association strength, rather than the cosine, is a suitable choice (van Eck and Waltman 2009). However, the reference-by-article matrix, A, is not binary due to the idf weighting of cited references, so we apply the cosine in both cases.
Based on optimizing weighted modularity (Q) (Newman 2004), partitions corresponding to different values of K, the resolution level (i.e., the number of clusters in a partition), were obtained. Assume a network, where the nodes have been partitioned into K clusters, and let Ci be the cluster to which node i has been assigned. Q is expressed in terms of the weighted adjacency matrix, Aij.
The modularity of a given partition is the probability that links fall within clusters in the network minus the expected probability in an equivalent network, where the equivalent network has the same number of nodes, and links placed at random while preserving the strength of the nodes (i.e., preserving wi, for each node i).
It has been showed mathematically that the optimization of (5) has a resolution limit (Fortunato and Barthelemy 2007). A concern, then, is the possible existence of undetectable subclusters within the clusters generated by optimizing (5). Moreover, we want study the performance of the two similarity order approaches, not relative to exactly one partition per approach, but at several different resolution levels, since we are interested in if the potential performance differences are dependent on resolution level. It is clear, then, that a best number of clusters approach is not feasible in our case.
We decided to use a multiple resolution method put forward by Arenas et al. (2008). ... By optimizing (6) for different values of r, partitions at different resolution levels are obtained.
By optimizing (6) with a heuristic described in (Blondel et al. 2008), using different values of r, we arrived at a range of partitions, from fine-grained (K = 5000) to coarse (K = 500).
We define cluster quality in terms of a divergence measure based on Shannon entropy. Let p be a distribution and let H(p) denote the Shannon entropy for p. The Generalized Jensen– Shannon divergence (GJSD) can be then be defined as (Lin 1991) ... Lower GJSD values for C are preferred to larger ones, since a lower value indicates a more coherent cluster.
At each of the 19 resolution levels, SO outperforms FO, and thereby consistently does so.
In this study we have worked with a range of partitions, from fine-grained to coarse, and investigated if one of the similarity approaches consistently performs better than the other.
The results show that SO consistently outperforms FO. Moreover, each difference between SO and FO in overall partition quality values is significant at the 0.01 level.
Taking the large dataset of the present study into consideration, as well as the fact that we made the comparisons at several different resolution levels, we believe that the results of the study are of importance.
Since the second-order approach, but not the first-order one, is able to detect that two documents are similar by detecting that there are other documents such that the two documents are both (directly) similar to each of these other documents, the obtained similarity order results are intuitively expected.
Moreover, a potential benefit of calculating SO from a symmetric matrix of FO, and thereby comparing similarity profiles instead of using the more traditional local (direct) approach based on the asymmetric occurrence matrix, is that second-order similarity calculations are based on a larger amount of data and
may therefore involve less statistical uncertainty.
Compared to the network based on FO, the network based on SO tends to have higher transitivity (i.e., a higher value on the global clustering coefficient), indicating that nodes to a higher degree tend to cluster together (Wasserman and Faust 1994).
Further, when first orders-similarities are used, documents might be highly similar with respect to subject matter without sharing any references and thus generating a zero first-order cosine similarity coefficient. This undesirable outcome can partly be solved by calculating SO instead.
沒有留言:
張貼留言