2013年4月30日 星期二

Schubert, A., & Soós, S. (2010). Mapping of science journals based on h-similarity. Scientometrics, 83(2), 589-600.

Schubert, A., & Soós, S. (2010). Mapping of science journals based on h-similarity. Scientometrics, 83(2), 589-600.

information visualization

本研究以h-similarity(Schubert 2010)做為評估期刊間相似程度的方式,利用Walktrap Community Findig (WCF) 演算法 (Pons and Latapy 2005)將期刊進行叢集,產生的結果能夠代表科學領域的結構,因此將叢集的結果與ESI(Essential Science Indicator)資料庫的主題分類相比較。
h-similarity的計算是以期刊的引用集合為基礎,利用h-index的觀念,找出引用此期刊前h名並且引用次數超過h次的期刊做為期刊的h-核心(h-core)。兩種期刊間的相似程度便是利用屬於它們的h-core進行Jaccard指標(Jaccard index)的計算結果進行估算,其值在1與0之間。當兩種期刊的h-core完全不同時,它們之間的h-similarity為0;兩種期刊的h-core相同的期刊愈多,h-similarity的值就愈高。本研究所使用的WCF演算法(Pons and Latapy 2005)是一種以模組性(modularity)最佳化(Girvan and Newman 2002; Newman and Girvan 2004; Newman 2006)的社群偵測(community detection)演算法,叢集的結果可以使得在集群之間有密度較高的連結並且這些連結的強度也較強。
以上述的相似程度的方式和演算法對2006年SCI JCR資料庫接近6000種期刊進行叢集的結果,共找出61個集群(稱為h-cluster),而這些集群內包含的期刊數目差異相當大,包含期刊數最多的兩個h-cluster分別有1214和904種期刊,而這兩個集群內期刊的主題為生物科技(biotechnology)和臨床醫學(clinical medicine)。本研究利用ESI的主題分類檢視每一個h-cluster內包含期刊所屬的主題分類,並且對集群內的期刊測量它們在集群上的中介中心性(betweenness centrality),以中介中心性最高的期刊做為此h-cluster的代表。結果發現h-cluster內包含的期刊大多屬於同一個主題分類之下。從另一個方向檢視每一個主題分類內包含的h-cluster則發現,如果某些h-cluster彼此都包含於同一個主題分類時,這些主題分類通常是較廣泛的主題,例如工程(engineering)。另外,本研究也將兩種分類系統作為期刊的特徵進行對應分析(correspondence analysis),以便將主題分類和h-cluster的結果視覺化,結果可以發現相關的分類會映射到鄰近的區域。

Journals covered by the 2006 Science Citation Index Journal Citation Reports database have been subjected to a clustering procedure utilizing h-similarity as the underlying similarity measure.

In this paper an attempt is made to use this h-similarity measure for clustering science journals, and thereby to gain a structural map of science fields.

The h-similarity between journals A and B has been defined through their ranked cited journal list. The h-core of a ranked cited journal list is the largest top h-element list with each item having at least h citations. Then we can define the h-similarity of two journals as
hA,B = (HA "and" HB) / (HA "or" HB)
where hA,B is the h-similarity between journals A and B, the sets HA and HB are the h-cores of A and B, respectively, "and" denotes the intersection, "or" denotes the union and || denotes the cardinality of sets.

The h-similarity is a Jaccard-type similarity measure with a range [0, 1] that has value 0 if and only if the two h-cores have no common elements and has value 1 if and only if the two h-cores contain identical elements in whatever order.

The set of journals covered by the 2006 Science Citation Index Journal Citation Reports (SCI JCR 2006) database, with approximately 6000 elements, was subjected to a clustering procedure (or, more precisely, a community detection exercise) utilizing h-similarity as the underlying similarity measure.

Beyond classification of journals into fields and subfields, the auxiliary aim of the experiment was to explore the relationship of this novel classification to existing ones, of which the most straightforward and first example is the system of subject categories used by the Essential Science Indicators (ESI) database of Thomson–Reuters (formerly, ISI).

The ESI system (unlike several other classification scheme used in Thomson–Reuters databases) does not allow multiple classification, so each journal is assigned to exactly one of the 22 categories.

Instead of a direct comparison of h-clusters and ESI-fields, we attempted to characterize the emerging clusters in terms of ESI categories and vice versa, providing also an interpretation of them in terms of each other.

As the first step, we computed the pairwise h-similarities of the 6000 titles. The resulting proximity matrix, or, rather, the list of weighted journal pairs was conceived as the edgelist of a similarity graph, i.e. a weighted graph expressing the similarity pattern of the domain.

The basic idea w.r.t. partitioning the domain to journal sets representing fields and subfields was then to find subgraphs (communities) in this large network of journals that are (1) dense (in the sense that most members are similar to most other ones) and (2) strongly connected (meaning a high sum of h-similarity values).

To this end, a community detection method was chosen to be applied on the large network of titles that took into account edge weights (h-similarity values) of the graph. We used the Walktrap Community Finding (WCF) algorithm (Pons and Latapy 2005) as implemented in the igraph R package (R Development Core Team 2009) by Pons and Csardi (no date) and Csardi and Nepusz (2006), that attempts to find dense subgraphs by random walks.

The WCF algorithm works in an agglomerative fashion, starting with the strongest communities and merging the closest ones in consecutive steps until the whole network is reconstructed. This procedure allowed us to select particular levels of agglomeration, yielding a hierarchical classification, according to some optimization criteria. For optimization we used the modularity function of Newman and Girvan (Girvan and Newman 2002; Newman and Girvan 2004; Newman 2006):

Using this measure as the object function to be maximized, we selected the community structure of highest modularity as the level of aggregation to be evaluated and compared to the ESI system.

The value of the modularity function reached its single maximum—after a slow and gradual increase and before a sudden fall— at a cluster structure with 61 clusters.

Due to the agglomerative nature of the process, it was the most inclusive level of aggregation, representing the general fields within this context.

The size distribution (Fig. 1) resulted from the solution providing 61 fields, being rather asymmetric and skewed, indicates two ‘‘giant’’ fields (with 1214 and 904 titles), a few extensive fields (between 200 and 400 journals) and many middle-sized or small categories (with less than 200 journals).

In the first cluster (n = 1214) clinical medicine, biology and biochemistry, molecular biology/genetics and pharmacology/toxicology jointly account for more than 60% of the content, on the basis of which it can be considered a fairly coherent field of biotechnology.

The second cluster has an even more explicit orientation, since it is clearly dominated by clinical medicine (with more than 80% share among ESI areas).

In general, we can state that, in all cases, h-clusters emerged as rather coherent fields in terms of ESI categories.

Though cluster profiles showed somewhat varying distributions, from one heavily contributing ESI field (like in the case above) to a couple of characteristic categories, the top contributors, also covering the majority of the journals included, consistently identified the meaning of each cluster.

The exercise showed that, from the perspective of h-similarity, ESI fields are quite broad or diverse categories in many cases.

To quantify the h-similarity structure of ESI categories, we used the well-known Shannon index, the latter being a well-comprehensible measure of the position of a category on the uniformity–heterogeneity scale: the more heterogeneous a category is, the higher is the value of the Shannon index (also called entropy or diversity depending on the context of application).

Engineering is the top rated, or the most diverse ESI field; computer science, social sciences in general and the field of economics and business are the rest in the top four. On the other extreme, microbiology and immunology behave as monolithic: in general, the life sciences and chemistry tend to bear a value below the average diversity.

In order to capture all of the relations discussed so far in an expressive manner, we conducted a correspondence analysis (CA) upon the cross-tabulation of h-clusters and ESI fields as journal features or variables. In this way, by modelling h-clusters as ESI-profile points and vice versa, CA makes it possible to explore (1) the position of h-clusters relative to each other as determined by the ESI contributions, (2) the relative position of ESI fields based upon h-fields and also—with some limitations—(3) the relation between the two partitions.

It is most striking from the results, as shown in Fig. 3, that field-points in both taxonomy are clearly separated and organized into well-readable regions: the bottom right quadrant constitutes the biomedical life sciences, the upper right quadrant is for the environmental sciences. There is a relatively standalone point for chemistry, that is in accord with our previous observations concerning the field. On the left side, along the horizontal reference line lie the physical and engineering sciences (both applied and general), and the bottom left quadrant provides space for traditionally mathematically oriented social sciences (economics and business). In this corner, as a distant point, we can also find mathematics. (Point sizes are proportional to the mass of the corresponding point/field.)

In order to refine the interpretation of h-fields, the core journals within each subgraphs of titles have been identified. The core journal of a h-field in the present context was understood as a representative member of the class, to which the majority of cluster members are similar, that is, which bears the most extensive similarity-based kinship in the cluster. We called these titles the ‘‘prototypes’’ of the h-fields.




The procedure of tracing prototypes, then, consisted of the following steps.



(1) In the first step, to clarify and sharpen similarity patterns, we omitted from each cluster the edges with a weight (similarity) value below a selected threshold. This cutoff parameter was set to 0.2.

(2) In these clarified clusters betweenness centrality scores were calculated for each element, and the title with maximum score (in the unambiguous case) was qualified as the core journal.

(3) For each h-cluster, sub-clusters were identified utilizing the hierarchical nature of our classification procedure. In particular, we picked a second (lower) level of the hierarchy based on the distribution of modularity values that yielded approximately 200 clusters. Steps (1)–(2) were repeated at this level, resulting in a core journal for each of the 200 clusters. Sub-clusters covered by h-clusters, then, provided a set of

further representative titles for each h-cluster at a more detailed level of the classification.

2013年4月29日 星期一

Van Eck, N. J., Waltman, L., Noyons, E. C., & Buter, R. K. (2010). Automatic term identification for bibliometric mapping. Scientometrics, 82(3), 581-596.

Van Eck, N. J., Waltman, L., Noyons, E. C., & Buter, R. K. (2010). Automatic term identification for bibliometric mapping. Scientometrics, 82(3), 581-596.

information visualization

詞語地圖(term map)能夠將科學領域的結構視覺化。在這裡,詞語指的是能夠代表領域特定概念的詞(words)或片語(phrase),詞語地圖便是為了呈現出領域內重要的詞語之間的關係所產生的圖形。為了製作詞語地圖,本研究提出自動詞語確認(automatic term identification)的方法,以減少專家勞力並避免主觀判斷帶來的問題。考慮到從語料庫確認的詞語必須同時具有單元完整性(unithood)和主題相關性(termhood)兩方面的特質 (Kageura and Umino, 1996),本研究建議的方法包括三個階段:第一階段利用詞類標示器(part-of-speech tagger)產生出來的結果(Schmid,  1994; Schmid, 1995),抽取輸入語料內的名詞片語,做為候選詞語。第二階段比較候選詞語的出現頻率和候選詞語內的第一個詞與其餘部分的出現頻率,計算概似比(likelihood ratio) (Dunning, 1993),評估它們為完整語意單位(semantic unit)的程度,挑選單元完整性比較高的候選詞語。第三階段計算詞語的主題相關性是本研究的重要貢獻。在確認單元完整並與主題相關的詞語後,以每一對詞語之間相關強度(association strength) (Van Eck and Waltman 2009)的值代表它們之間的關係,利用VOS技術 (Van Eck and Waltman 2007a)產生詞語地圖。
本研究建議利用詞語在各主題上的分布傾向來估計每一個詞語的主題相關性。在本研究裡具有較高主題相關性的詞語是只與某一個或較少數主題有較強的關連的詞語。所以對於每一個詞語,本研究建議比較此一詞語在各主題上的分布情形與原先各主題的分布情形,如果差異較大便表示該詞語的主題相關性較高,也就是具有較高主題相關性的詞語對於少數的主題具有區辨力(discriminatory)。但由於每一篇文件都可能包括多個主題,無法單純地統計詞語在各主題上的分布情形以及各主題的分布情形,因此本研究利用機率式隱含語意分析(probabilistic latent semantic analysis, PLSA)的方式(Hofmann, 2001)估計各種詞語在各主題上的分布情形,並與原先各主題的分布情形相比較,找出分布偏向於少數主題的詞語。
評估自動化詞語確認的結果相當困難(Pazienza et al., 2005)。本研究為了評估詞語確認的結果,以15種 ISI主題分類為作業研究(operational research)的期刊,建立了該領域的詞語地圖並且以兩種方式進行評估:第一種方式是比較這種方法與沒有使用PLSA的詞語確認和利用詞語的出現頻率(frequency of occurrence)選取詞語等其他兩種方法的回收率(recall)與精確率(precision)。第二種方式則是由作業研究領域的專家對產生出來的詞語地圖進行品質審核。第一種評估方法的結果顯示除了在最高和最低的回收率以外,本研究建議的方法都比其他兩種方法能夠得到更高的精確率。在專家審查的結果則發現本研究產生的詞語地圖能夠表現出作業研究領域可分為以方法論為導向(methodology-oriented)及以應用為導向(application-oriented)的兩類研究主題,這個結果相當符合專家的想法。但是目前的結果也呈現出這個方法獲得意義較為廣泛的詞語、圖形上沒有包括某些主題以及有些主題非常相近的詞語在圖形上彼此間並不靠近等問題。

A term map is a map that visualizes the structure of a scientific field by showing the relations between important terms in the field.

To evaluate the proposed methodology, we use it to construct a term map of the field of operations research. The quality of the map is assessed by a number of operations research experts.

Other maps show relations between words or keywords based on co-occurrence data (e.g., Rip and Courtial 1984; Peters and Van Raan 1993; Kopcsa and Schiebel 1998; Noyons 1999; Ding et al. 2001). The latter maps are usually referred to as co-word maps.

By a term we mean a word or a phrase that refers to a domain-specific concept. Term maps are similar to co-word maps except that they may contain any type of term instead of only single-word terms or only keywords.

Selection of terms based on their frequency of occurrence in a corpus of documents typically yields many words and phrases with little or no domain-specific meaning. Inclusion of such words and phrases in a term map is highly undesirable for two reasons. First, these words and phrases divert attention from what is really important in the map. Second and even more problematic, these words and phrases may distort the entire structure shown in the map.

However, manual term selection has serious disadvantages as well. The most important disadvantage is that it involves a lot of subjectivity, which may introduce significant biases in a term map. Another disadvantage is that it can be very labor-intensive.

Given a corpus of documents, we first identify the main topics in the corpus. This is done using a technique called probabilistic latent semantic analysis (Hofmann 2001). Given the main topics, we then identify in the corpus the words and phrases that are strongly associated with only one or only a few topics. These words and phrases are selected as the terms to be included in a term map.

An important property of the proposed methodology is that it identifies terms that are not only domain-specific but that also have a high discriminatory power within the domain of interest. This is important because terms with a high discriminatory power are essential for visualizing the structure of a scientific field.

We define unithood as the degree to which a phrase constitutes a semantic unit. Our idea of a semantic unit is similar to that of a collocation (Manning and Schu¨tze 1999). Hence, a semantic unit is a phrase consisting of words that are conventionally used together. The meaning of the phrase typically cannot be fully predicted from the meaning of the individual words within the phrase.

We define termhood as the degree to which a semantic unit represents a domain-specific concept.

Linguistic approaches are mainly used to identify phrases that, based on their syntactic form, can serve as candidate terms.

Statistical approaches are used to measure the unithood and termhood of phrases.

Most terms have the syntactic form of a noun phrase (Justeson and Katz 1995; Kageura and Umino 1996). Linguistic approaches to automatic term identification typically rely on this property. These approaches identify candidate terms using a linguistic filter that checks whether a sequence of words conforms to some syntactic pattern. Different researchers use different syntactic patterns for their linguistic filters (e.g., Bourigault 1992; Dagan and Church 1994; Daille et al. 1994; Justeson and Katz 1995; Frantzi et al. 2000).

Statistical approaches to measure unithood are discussed extensively by Manning and Schu¨tze (1999). The simplest approach uses frequency of occurrence as a measure of unithood (e.g., Dagan and Church 1994; Daille et al. 1994; Justeson and Katz 1995). More advanced approaches use measures based on, for example, (pointwise) mutual information (e.g., Church and Hanks 1990; Damerau 1993; Daille et al. 1994) or a likelihood ratio (e.g., Dunning 1993; Daille et al. 1994). Another statistical approach to measure unithood is the C-value (Frantzi et al. 2000). The NC-value (Frantzi et al. 2000) and the SNC-value (Maynard and Ananiadou 2000) are extensions of the C-value that measure not only unithood but also termhood. Other statistical approaches to measure termhood can be found in the work of, for example, Drouin (2003) and Matsuo and Ishizuka (2004). In the field of machine learning, an interesting statistical approach to measure both unithood and termhood is proposed by Wang et al. (2007).

Termhood is measured as the degree to which the occurrences of a semantic unit are biased towards one or more topics.

In the first step of our methodology, we use a linguistic filter to identify noun phrases. We first assign to each word occurrence in the corpus a part-of-speech tag, such as noun, verb, or adjective. The appropriate part-of-speech tag for a word occurrence is determined using a part-of-speech tagger developed by Schmid (1994, 1995). We use this tagger because it has a good performance and because it is freely available for research purposes.

The most common approach to measure unithood is to determine whether a phrase occurs more frequently than would be expected based on the frequency of occurrence of the individual words within the phrase.

To measure the unithood of a noun phrase, we first count the number of occurrences of the phrase, the number of occurrences of the phrase without the first word, and the number of occurrences of the first word of the phrase. In a similar way as Dunning (1993), we then use a so-called likelihood ratio to compare the first number with the last two numbers.

The main idea of the third step of our methodology is to measure the termhood of a semantic unit as the degree to which the occurrences of the unit are biased towards one or more topics.

To measure the degree to which the occurrences of semantic unit uk, where k (belongs to) {1,…,K}, are biased towards one or more topics, we use two probability distributions, namely the distribution of semantic unit uk over the set of all topics and the distribution of all semantic units together over the set of all topics. These distributions are denoted by, respectively, P(tj | uk) and P(tj), where j (belongs to) {1,…, J}. ... The dissimilarity between the two distributions indicates the degree to which the occurrences of uk are biased towards one or more topics. We use the dissimilarity between the two distributions to measure the termhood of uk.

For example, if the two distributions are identical, the occurrences of uk are unbiased and uk most probably does not represent a domain-specific concept. If, on the other hand, the two distributions are very dissimilar, the occurrences of uk are strongly biased and uk is very likely to represent a domain-specific concept.

The dissimilarity between two probability distributions can be measured in many different ways. One may use, for example, the Kullback–Leibler divergence, the Jensen–Shannon divergence, or a chi-square value.

In (3), termhood (uk) is calculated as the negative entropy of this distribution. Notice that termhood (uk) is maximal if P(tj | uk) = 1 for some j and that it is minimal if P(tj | uk) = P(tj) for all j. In other words, termhood (uk) is maximal if the occurrences of uk are completely biased towards a single topic, and termhood (uk) is minimal if the occurrences of uk do not have a bias towards any topic.

In order to allow for a many-to-many relationship between corpus segments and topics, we make use of probabilistic latent semantic analysis (PLSA) (Hofmann 2001).

It was originally introduced as a probabilistic model that relates occurrences of words in documents to so-called latent classes. In the present context, we are dealing with semantic units and corpus segments instead of words and documents, and we interpret the latent classes as topics.

PLSA assumes that each occurrence of a semantic unit in a corpus segment is independently generated according to the following probabilistic process. First, a topic t is drawn from a probability distribution P(tj), where j (belongs to) {1,…,J}. Next, given t, a corpus segment s and a semantic unit u are independently drawn from, respectively, the conditional probability distributions P(si | t), where i (belongs to) {1,…,I}, and P(uk | t), where k (belongs to) {1,…,K}. This then results in the occurrence of u in s.

P(si, uk) =  sum (from j=1 to J) P(tj)P(si|tj)P(uk|tj)

We estimate these parameters using data from the corpus. Estimation is based on the criterion of maximum likelihood. The log-likelihood function to be maximized is given by
L = sum(from i=1 to I) sum(from k=1 to K) nik log P(si, uk)
We use the EM algorithm discussed by Hofmann (1999, Sect. 3.2) to perform the maximization of this function.

After estimating the parameters of PLSA, we apply Bayes’ theorem to obtain a probability distribution over the topics conditional on a semantic unit. This distribution is given by

P(tj|uk) = P(tj)P(uk|tj) / sum (from j=1 to J) (P(P(tj)P(uk|tj))

In a similar way as discussed earlier, we use the dissimilarity between the distributions P(tj | uk) and P(tj) to measure the termhood of uk.

We first selected a number of OR journals. This was done based on the subject categories of Thomson Reuters. The OR field is covered by the category Operations Research & Management Science. Since we wanted to focus on the core of the field, we selected only a subset of the journals in this category. More specifically, a journal was selected if it belongs to the category Operations Research & Management Science and possibly also to the closely related category Management and if it does not belong to any other category. This yielded 15 journals, which are listed in the first column of Table 1.

In the first step of our methodology, the linguistic filter identified 2662 different noun phrases. In the second step, the unithood of these noun phrases was measured. 203 noun phrases turned out to have a rather low unithood and therefore could not be regarded as semantic units. ... The other 2459 noun phrases had a sufficiently high unithood to be regarded as semantic units.

In the third and final step of our methodology, the termhood of these semantic units was measured. To do so, each title-abstract pair in the corpus was treated as a separate corpus segment. For each combination of a semantic unit uk and a corpus segment si, it was determined whether uk occurs in si (nik = 1) or not (nik = 0). Topics were identified using PLSA. This required the choice of the number of topics J. Results for various numbers of topics were examined and compared. Based on our own knowledge of the OR field, we decided to work with J = 10 topics.

The evaluation of a methodology for automatic term identification is a difficult issue. There is no generally accepted standard for how evaluation should be done. We refer to Pazienza et al. (2005) for a discussion of the various problems.

We first perform an evaluation based on the well-known notions of precision and recall. We then perform a second evaluation by constructing a term map and asking experts to assess the quality of this map.

Precision is the number of correctly identified terms divided by the total number of identified terms.

Recall is the number of correctly identified terms divided by the total number of correct terms.

Unfortunately, because the total number of correct terms in the OR field is unknown, we could not calculate the true recall. This is a well-known problem in the context of automatic term identification (Pazienza et al. 2005).

To circumvent this problem, we defined recall in a slightly different way, namely as the number of correctly identified terms divided by the total number of correct terms within the set of all semantic units identified in the second step of our methodology. Recall calculated according to this definition provides an upper bound on the true recall. However, even using this definition of recall, the calculation of precision and recall remained problematic. The problem was that it is very time-consuming to manually determine which of the 2459 semantic units identified in the second step of our methodology are correct terms and which are not. We solved this problem by estimating precision and recall based on a random sample of 250 semantic units.

It is clear from the figure that our methodology outperforms the two simple alternatives. Except for very low and very high levels of recall, our methodology always has a considerably higher precision than the variant of our methodology that does not make use of PLSA.

A term map is a map, usually in two dimensions, that shows the relations between important terms in a scientific field. Terms are located in a term map in such a way that the proximity of two terms reflects their relatedness as closely as possible. That is, the smaller the distance between two terms, the stronger their relation. The aim of a term map usually is to visualize the structure of a scientific field.

It turned out that, out of the 2459 semantic units identified in the second step of our methodology, 831 had the highest possible termhood value. This means that, according to our methodology, 831 semantic units are associated exclusively with a single topic within the OR field. We decided to select these 831 semantic units as the terms to be included in the term map. This yielded a coverage of 97.0%, which means that 97.0% of the title-abstract pairs in the corpus contain at least one of the 831 terms to be included in the term map.

The term map of the OR field was constructed using a procedure similar to the one used in our earlier work (Van Eck and Waltman 2007b). This procedure relies on the association strength measure (Van Eck and Waltman 2009) to determine the relatedness of two terms, and it uses the VOS technique (Van Eck and Waltman 2007a) to determine the locations of terms in the map.

The most serious criticism on the results of the automatic term identification concerned the presence of a number of rather general terms in the map.

Another point of criticism concerned the underrepresentation of certain topics in the term map. There were three experts who raised this issue. One expert felt that the topic of supply chain management is underrepresented in the map. Another expert stated that he had expected the topic of transportation to be more visible. The third expert believed that the topics of combinatorial optimization, revenue management, and transportation are underrepresented.

As discussed earlier, when we were putting together the corpus, we wanted to focus on the core of the OR field and we therefore only included documents from a relatively small number of journals. This may for example explain why the topic of transportation is not clearly visible in the map.

When asked to divide the OR field into a number of smaller subfields, most experts indicated that there are two natural ways to make such a division. On the one hand, a division can be made based on the methodology that is being used, such as decision theory, game theory, mathematical programming, or stochastic modeling. On the other hand, a division can be made based on the area of application, such as inventory control, production planning, supply chain management, or transportation. There were two experts who noted that the term map seems to mix up both divisions of the OR field. According to these experts, one part of the map is based on the methodology-oriented division of the field, while the other part is based on the application-oriented division.

The experts pointed out that sometimes closely related terms are not located very close to each other in the map. One of the experts gave the terms inventory and inventory cost as an example of this problem. In many cases, a problem such as this is probably caused by the limited size of the corpus that was used to construct the map. In other cases, the problem may be due to the inherent limitations of a two-dimensional representation.

Our main contribution consists of a methodology for automatic identification of terms in a corpus of documents. Using this methodology, the process of selecting the terms to be included in a term map can be automated for a large part, thereby making the process less labor-intensive and less dependent on expert judgment. Because less expert judgment is required, the process of term selection also involves less subjectivity.

In general, we are quite satisfied with the results that we have obtained. The precision/recall results clearly indicate that our methodology outperformed two simple alternatives. In addition, the quality of the term map of the OR field constructed using our methodology was assessed quite positively by five experts in the field. However, the term map also revealed a shortcoming of our methodology, namely the incorrect identification of a number of general noun phrases as terms.

As scientific fields tend to overlap more and more and disciplinary boundaries become more and more blurred, finding an expert who has a good overview of an entire domain becomes more and more difficult. This poses serious difficulties for any bibliometric method that relies on expert knowledge.

2013年4月26日 星期五

Takeda, Y., & Kajikawa, Y. (2009). Optics: A bibliometric approach to detect emerging research domains and intellectual bases. Scientometrics, 78(3), 543-558.

Takeda, Y., & Kajikawa, Y. (2009). Optics: A bibliometric approach to detect emerging research domains and intellectual bases. Scientometrics, 78(3), 543-558.

information visualization

本研究利用網路型態(topology)進行叢集,從期刊論文的引用所構成的網路上探討研究的結構(the structure of research)並且偵測興起中的研究領域(emerging research domains)。用來分析的資料是從ISI資料庫中以發表期刊的主題屬於光學(optics)以及內容與光學相關的論文以及它們的引用關係,論文共有281,404筆。然後以最大的成分做為叢集演算法的輸入,選擇做為輸入的論文共有203,203筆,從論文發表的年份和被引用的情形來看,被捨棄的論文通常是那些比較老舊並且被引用較少的論文。所使用的叢集演算法[21, 22]是由Newman and Girvan (2004)提出以模組性(modularity)最佳化為基礎的演算法。叢集的結果總共將論文分為825個群體,其中前五個群體的成員幾乎佔輸入論文的90%。這五個群體的主題分別是光通訊(optical communication)、量子光學(quantum optics)、光學資料處理(optical data processing)、光學分析(optical analysis)和雷射(lasers)。接著反覆將叢集結果產生的每一個群體輸入叢集演算法,產生次一級的子群體。分析每一級較大的子群體的主題並計算成成員論文出版年份的平均值,根據論文出版年份平均值來判斷該子群體是否是興起中的研究領域。本研究也發現第一個群體「光通訊」的論文出現較晚,直到1990年後才有非常快速的成長,但是很快地便超越其他的主題。就論文所屬的國家來比較,美國、日本、德國、法國和中國都生產了大量的論文,但美國的論文有較高的引用情形,而中國的論文則少有被引用。本研究並且將叢集產生的群體進行視覺化[24](Adai, Date, Wieland, and Marcotte, 2004),使得彼此有引用關係的節點被定位在鄰近,形成群聚,從產生的圖形判斷群體的主題為光學這個學科的研究前沿(research fronts)或是知識基礎(intellectual bases)。前者的圖形較為緊湊(compact),大多為應用或基礎型的研究,例如光通訊、量子光學和光學資料處理。屬於知識基礎的主題,例如光學分析和雷射,大多與做為研究所使用的儀器相關,其圖形則看起來較為伸展(stretched)。

In this paper, we constructed a citation network of papers and performed topological clustering method to investigate the structure of research and to detect emerging research domains in optics.

There are various motivations to conduct bibliometric works; to evaluate research output [2–4], to grasp overall structure of research [5–8], and to detect emerging research domains [7–9].

We collected citation data of optics-related publications from the Science Citation Index (SCI) compiled by the Institute for Scientific Information (ISI). We used the Web of Science, which is a Web-based user interface of ISI’s citation databases.

We collected citation data by two manners. One is journal based, and another is topic based approach.

Therefore, we focused on the maximum connected component. The retrieved data were converted into a non-weighted, non-directed network. The obtained network currently has 203,203 papers (72.21% of the retrieved data).

Subsequently, the network was divided into clusters using the topological clustering method [21, 22]. Traditionally, co-citation has been used to analyze a citation network. However, because co-citation is accompanied by a time lag to create a link, and analysis of intercitation is more relevant in the similarity of pairs of documents than co-citation [23], we used intercitation as a link.

The clustering algorithm is based on modularity Q, which is defined as follows [21, 22]:

In other words, Q is the fraction of links that fall within clusters, minus the expected value of the same quantity if the links fall at random without regard for the clustered structure. Since a high value of Q represents a good division, we stopped joining when ΔQ became minus. A good partition of a network into clusters means there are many within-cluster links and minimal between-cluster links.

After clustering the network, we heuristically characterized each cluster by the titles and abstracts of papers that are frequently cited by the other papers in the same cluster.

The clustered network is visualized by using a large graph layout (LGL) [24], which is based on a spring layout algorithm where links play the role of spring connecting nodes. Thanks to such layout, papers that cite each other and form a group can be located in closer proximity.

The citation network of optics can be divided into 825 clusters by topological clustering method, where the number of nodes in each cluster varies from 2 (the smallest clusters) to 50,725 (the biggest cluster, #1). ... Cluster size, i.e., the number of nodes in each cluster, steeply decreases until the 4th cluster, and after the 5th cluster they become negligible. Therefore, in the following, we focus on the top 5 clusters. They cover almost 90% of papers in the network.

Therefore, to detect  research front, in our words, emerging research domains, the usage of average publication year in the cluster or time slices of the networks [9, 26] are effective. In our  analysis, we can detect emerging subdomains of optics research by using average publication year.

We found that optics consist of main five subclusters, optical communication,  quantum optics, optical data processing, optical analysis, and lasers. In all of these clusters, USA has the largest publications and receives citations, which means the leadership and the presence of USA. China has large number of publications, but not citations. The average publication year of papers in each cluster indicated that the largest cluster, optical communication, has the lowest publications before 1990 but  recently it steeply increases.

The visualization of the network for the top five clusters showed that  three large clusters are compact but two relatively small clusters were stretched. This  implies that the later works as an intellectual base for the former.

2013年4月21日 星期日

Leydesdorff, L., & Rafols, I. (2009). A global map of science based on the ISI subject categories. Journal of the American Society for Information Science and Technology, 60(2), 348-362.

Leydesdorff, L., & Rafols, I. (2009). A global map of science based on the ISI subject categories. Journal of the American Society for Information Science and Technology, 60(2), 348-362.

information visualization

本研究對175個ISI主題分類(ISI subject categories)的相互引用資料進行探索性因素分析(exploratory factor analysis),從這個結果來驗證ISI主題分類是否能夠進行資訊視覺化。相關的研究包括Boyack, Klavans, and Börner (2005)利用VxOrd的VxInsight演算法將所有ISI資料庫收錄的期刊,根據它們相互的引用關係進行視覺化,能夠反映科學結構的期刊映射圖,同時以k-means叢集演算法產生新的分類;Moya-Anagón et al. (2007)則是利用ISI主題分類的共被引(co-citation)資料做為尋徑者網路(Pathfinder network)演算法的輸入資料來產生映射圖。本研究的研究資料是2006年的ISI期刊引用資料,在2006年175個ISI主題分類內有三個主題分類(“Psychology, biological,” “Psychology, experimental,” and “Transportation”)沒有引用資料,但有被引用的資料。資料先以cosine進行正規化,利用SPSS (v15)進行因素分析,並且利用Pajek程式 (Batagelj & Mrvar, 2007)的Kamada and Kawai (1989)演算法進行視覺化。以具有引用資料的172個主題分類而言,經過因素分析後,共可發現14個因素,這些因素與學科(disciplines)的分類相符合。以被引用的資料進行因素分析,其結果也可以看出學科分類的樣貌。比較引用資料和被引用資料的結果,172個主題分類的154個(89%)在兩個結果中落於相同的因素。

The ISI subject categories classify journals included in the Science Citation Index (SCI). The aggregated journal-journal citation matrix contained in the Journal Citation Reports can be aggregated on the basis of these categories. This leads to an asymmetrical matrix (citing versus cited) that is much more densely populated than the underlying matrix at the journal level. Exploratory factor analysis of the matrix of subject categories suggests a 14-factor solution. This solution could be interpreted as the disciplinary structure of science. The nested maps of science (corresponding to 14 factors, 172 categories, and 6,164 journals) are online at http://www.leydesdorff.net/map06.

In contrast, Boyack, Klavans, and Börner (2005) used the VxInsight algorithm (Davidson, Hendrickson, Johnson, Meyers, & Wylie, 1998) in order to map the whole journal structure as a representation of the structure of science.

Moya-Anagón et al. (2004, 2007) used cocitation and PathFinder for mapping the whole of science on the basis of the ISI subject categories.

Klavans and Boyack (2007, p. 438) noted that a journal may occupy a different position in a different context: Many journals report on developments in multiple disciplines; journals can also function as a major source of references in more than one specialty.

Since citation relations among journals are dense in discipline-specific clusters and otherwise virtually nonexistent, the journal-journal citation matrix can be considered nearly decomposable.

The next-order units represented by the square submatrices— and representing in this case disciplines or specialties—are reproduced in relatively stable sets (of journals), which may change over time. The sets of journals are functional subsystems that show a high density in terms of relations within the center (i.e., core journals), but are more open to change in relations at the margins.

The decomposition into nearly decomposable matrices has no analytical solution. However, algorithms can provide heuristic decompositions when there is no single unique correct answer (Newman, 2006a, 2006b).

The number of category attributions in the Science Citation Index is 9,848 for 6,164 journals in 2006 or, in other words, approximately 1.6 categories per journal. The coverage of the 172 categories ranges from 262 journals sorted under “Biochemistry and Molecular Biology” to 5 journals sorted under a single category. The average number of journals per category is 56.3 (see Figure 1).

In other words, our research question is different from Boyack et al.’s (2005) effort to generate a new classification using a bottom-up strategy and from that of Moya-Anagón et al. (2007), who employed the ISI subject categories as units of measurement (at p. 2169), and used factor analysis of the cocitation matrix for the validation of their so-called “factor scientograms.”

We wish to question the quality and validity of using the ISI subject categories for mapping purposes. Can these subject classifications be used in further research to demarcate the sciences and perhaps as field delineations, and if so, under what conditions?

As noted above, Moya-Anegón et al. (2007, p. 2173) used factor analysis of the cocitation matrix of the 218 categories of the Science Citation Index and Social Science Citation Index (2002) combined for the validation of their visualizations. These authors stated that a scree test had led them to the choice of 16 factors.

We approach the problem first factor-analytically using the asymmetrical matrix of aggregated citations among categories, and will subsequently try to map the sciences hierarchically top-down insofar as our results show that it is legitimate for us to do so.

The data was harvested from the CD-ROM version of the Journal Citation Reports of the Science Citation Index 2006. As indicated above, 175 subject categories are used. Three categories (“Psychology, biological,” “Psychology, experimental,” and “Transportation”) are no longer used as classifiers in the citing dimension, but four journals are still indicated with these three categories in the cited dimension. Thus, we work with 172 citing and 175 cited categories.


The matrix, accordingly, contains two structures: a cited and a citing one. Salton’s cosine was used for normalization in both the cited and citing directions (Ahlgren, Jarneving, & Rousseau, 2003; Salton & McGill, 1983).


Pajek is used for the visualizations (Batagelj & Mrvar, 2007) and SPSS (v15) for the factor analysis. The threshold for the visualizations is pragmatically set at cosine≥0.2. Visualizations are based on the algorithm of Kamada and Kawai (1989).



Let us focus on the structure in the citing dimension because this structure is actively maintained by the indexing service and is therefore current.... The factor loadings for the 172 categories on the 14 factors in the citing dimension are provided in the Appendix. They can be interpreted in terms of disciplines, such as physics, chemistry, clinical medicine, neurosciences, engineering, and ecology.

The factors in the cited dimension can be designated using precisely the same disciplinary classifications, but their rank order (that is, the percentage of variance explained by each factor) is different (Table 2). Out of the 172 categories, 154 (89%) fall in the same factor in both the citing and cited projections. ... The strong overlap between the results of the factor analysis in the cited and the citing dimension (Table 2) suggests that the matrix is nearly decomposable in terms of central tendencies.

Our results are consistent with previously reported maps (Boyack et al. 2005; Boyack & Klavans, 2007; Moya-Anagón et al., 2007), but we chose to exclude the social sciences.

2013年4月20日 星期六

Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351-374.

Boyack, K. W., Klavans, R., & Börner, K. (2005). Mapping the backbone of science. Scientometrics, 64(3), 351-374.

information visualization

一般的科學映射流程 [7] 包含以下的五個步驟:1) 選擇適合的資料來源;2) 選擇分析的單位,並從選擇的來源中抽取需要的資料;3) 選擇適合的相似程度(similarity)測量方式,並計算相似值;4) 利用定位(ordination)與叢集(clustering)演算法產生資料映射圖;5)根據映射圖結果進行探索,以解答研究的問題。過去的期刊映射研究大多使用期刊共被引資料的Pearson相關係數做為期刊間相似程度的計算方式,並利用多維尺度法(multidimensional scaling, MDS)進行定位來產生圖形[11~16]。Leydesdorff [17,18]也使用多維尺度法作為映射方法,但是他使用期刊之間的引用資料測量期刊間的相似程度。Leydesdorff 也曾經進一步應用期刊之間引用資料的Pearson相關係數測量期刊間的相似程度,並且使用Pajek程式 [27]分別繪製SCI和SSCI期刊的映射圖[22, 23]。此外,Campanario [19]利用自組織映射圖(self-organizing map)作為定位的方式。Tijssen and van Leeuwen [20]則利用期刊內容映射(journal content mapping),使得他們的研究可以包含非ISI資料庫內的期刊。

本研究對7121種ISI的SCI和SSCI資料庫裡的期刊,產生代表所有科學結構的映射圖,分析產生圖形的結構準確性(structural accuracy),同時也探討映射結果的局部準確性(local accuracy),後者所指的是屬於相同次學科(subdiscipline)的期刊能夠被群聚在一起(Klavans and Boyark, 2006),而前者所指的是彼此相互引用的期刊群聚在映射到圖形上時也有相互鄰近,也就是本研究所認為的科學的主幹(backbone of science)。

在這項研究中,利用VxOrd [32]演算法進行資料定位,以k-means法進行叢集,以八種方式測量期刊之間的相似程度,其中五種是利用期刊之間的引用(inter-citation)資料為基礎,包括原始次數、Cosine指標、Jaccard指標、Pearson相關係數和由Pudovkin and Garfield [25]提出的平均關連因數(average relatedness factor);另外三種是以共被引(cocitation)資料為基礎,包括原始資料、Pearson相關係數和作者等人[1]提出的K50指標。

本研究利用期刊在ISI的主題分類(subject categories)評估與比較各種相似程度測量方式所產生的映射圖結果。首先在局部準確性的評估上,本研究認為某一對期刊間如果具有較高的相似程度便應在同一主題分類中,映射圖上同一叢集的期刊也應屬於同一主題分類。[1]的研究結果發現經過VxOrd 演算法進行資料定位後可以增加局部準確性,並且以期刊之間的引用資料為基礎並經過正規化的四種指標表現相當,而以Cosine指標為最佳。本研究則以Gibbons and Roth [37] 所提出的交互資訊(mutual information)的檢測法來比較各種映射圖結果的結構準確度,同樣以ISI的205個主題分類為參考的分類標的。檢測的結果除了共被引的原始資料的分群結果不理想之外,其餘的各種相似程度測量方式所得的分群結果大致相當,並且以期刊之間引用資料為基礎的Pearson相關係數得到最好的結果,但是在200到250個分群時,以期刊之間引用資料為基礎的Jaccard指標也有相當的分類結果。

最後以局部準確性、結構準確性、規模可擴展性(scalability)和叢集結果的可判讀性(readability)對以期刊之間引用資料為基礎的五種方式和以期刊之間共被引資料為基礎的三種方式分別進行比較。以期刊之間共被引資料為基礎的三種方式而言,利用K50指標為相似程度測量方式所得到的結果在結構準確性上與Pearson相關係數所得到的結果大約相當,但前者具有較好的規模可擴展性和局部準確性,並且其呈現的結果在叢集大小與分布位置上都較為均衡。另一方面,在以期刊之間引用資料為基礎的五種方式,Cosine指標、Jaccard指標、Pearson相關係數等三種在規模可擴展性和可判讀性較其他兩種為佳,並且Jaccard指標所得到的映射圖有較高的結構準確性。

This paper presents a new map representing the structure of all of science, based on journal articles, including both the natural and social sciences. ... Eight alternative measures of journal similarity were applied to a data set of 7,121 journals covering over 1 million documents in the combined Science Citation and Social Science Citation Indexes. For each journal similarity measure we generated two-dimensional spatial layouts using the force-directed graph layout tool, VxOrd.

By accuracy, we mean that journals within the same subdiscipline should be grouped together, and groups of journals that cite each other should be proximate to each other on the map. The first results from this effort, dealing with local accuracy, appeared recently. By contrast, this paper focuses on structural accuracy and characterization of the map defining the structure or backbone of science.

Published journal-based maps have typically been focused on single disciplines, and have used a Pearson correlation on co-citation counts with multidimensional scaling (MDS). [11~16] Other discipline-level studies not using the Pearson/MDS technique include the use of relative inter-citation counts with MDS by Leydesdorff [17,18], the use of a self-organizing map by Campanario [19], and the work by Tijssen and van Leeuwen to include non-ISI journals in their maps using journal content mapping. [20]

Leydesdorff has used the 2001 JCR data to map 5,748 journals from the Science Citation Index (SCI) [22] and 1,682 journals from the Social Science Citation Index (SSCI) [23] in two separate studies. In both studies Leydesdorff uses a Pearson correlation on citing counts as the edge weights and the Pajek program for graph layout, progressively lowering thresholds to find articulation points (i.e., single points of connection) between different network components. These network components are his journal clusters. The only potential drawback to this solution is that as thresholds are lowered, newly identified small components (presumably two or three journals each) are dropped from the solution space, so that the total number of journals comprising Leydesdorff's clusters is substantially less than the number in the original set.

An alternative to using journals to map the structure of science has recently been investigated by Moya-Anegón and associates [9] to good effect. Using 26,062 documents with a Spanish address from the year 2000 as a base set, they used co-cited ISI category assignments to create category maps. Their highest level map shows the relative positions, sizes and relationships between 25 broad categories of science in Spain.

The general process followed by most practitioners for creating knowledge domain maps has been explained in detail elsewhere. [7] This process can vary slightly depending upon the specific research question, but typically contains the following steps: 1) selection of an appropriate data source, 2) selection of a unit of analysis (e.g. paper, journal, etc.) and extraction of the necessary data from the selected source, 3) choice of an appropriate similarity measure and calculation of similarity values, 4) creation of a data layout using a clustering or ordination algorithm, and 5) exploration of the map based on the data layout as a means of answering the original research questions. Here, we add another step after 4) - statistical validation - that allows us to choose the similarity measure that produces the most accurate map.

Based on these considerations, we obtained the complete set of 1.058 million records from 7,349 separate journals from the combined SCI and SSCI files for the year 2000. Of the 7,349 journals, analysis was limited to the 7,121 journals that appeared as both citing and cited journals. ... Journal inter-citation frequencies were directly counted from the citing and cited journal information in these 16.24 million reference pairs. The resulting journal-journal inter-citation frequency matrix was extremely sparse (98.6% of the matrix has zeros). ... While there was a great deal more cocitation frequency information, the journal-journal co-citation frequency matrix was also sparse (93.6% of the matrix has zeros).

For the purpose of map validation we also retrieved the ISI journal category assignments. For the combined SCI and SSCI, there were a total of 205 unique categories. Including multiple assignments, the 7,121 journals were assigned to a total of 11,308 categories, or an average of 1.59 categories per journal.

The five inter-citation measures include one unnormalized measure, raw frequency (IC-Raw); and four normalized measures, Cosine (IC-Cosine), Jaccard (IC-Jaccard), Pearson’s r (IC-Pearson), and the recently introduced average relatedness factor of Pudovkin and Garfield [25] (IC-RFavg).

The three co-citation measures include one unnormalized measure, raw frequency (CC-Raw); the vector-based Pearson’s r (CC-Pearson), and a new normalized frequency measure [1] that we call K50 (CC-K50). This new measure, K50, is simply a cosine-type value minus an expected cosine value. Ei,j is the expected value of Fi,j, and varies with the row sum, Sj, thus K50 is asymmetric and Eij<>Eji . Subtraction of an expected value component tends to accentuate ‘higher than expected’ relationships between two small journals or between a small and a large journal, and discounts ‘lower than expected’ relationships between large journals. We thus expect the K50 measure to do a better job than other measures of accurately placing small journals, and to reduce the influence of large and multidisciplinary journals on the overall map structure.

The most commonly used reduction algorithm is multidimensional scaling; however, its use has typically been limited to data sets on the order of tens or hundreds of items.

Factor analysis is another method for generating measures of relatedness. In a mapping context, it is most often used to show factor memberships on maps created using either MDS or pathfinder network scaling, rather than as the sole basis for a map. Yet, factor values can be used directly for plotting positions. For instance, Leydesdorff [23] directly plotted factor values (based on citation counts) to distinguish between pairs of his 18 factors describing the SSCI journal set.

Layout routines capable of handling these large data sets include Pajek, [27] which has recently been used on data sets with several thousand journals by Leydesdorff, [22,23] and which is advertised to scale to millions of nodes; self-organizing maps, [28] which can scale, with various processing tricks, to millions of nodes, [29] and the bioinformatics algorithm LGL, [30] capable of dealing with hundreds of thousands of nodes, which uses an iterative layout as well as data types and algorithms from the Boost Graph Library. [31]

We chose to use VxOrd, [32] a force-directed graph layout algorithm, over the other algorithms mentioned, for several reasons. VxOrd improves on a traditional force-directed approach by employing barrier jumping to avoid trapping of clusters in local minima, and a density grid to model repulsive forces. Because of the repulsive grid, computation times are order O(N) rather than O(N(square)), allowing VxOrd to be used on graphs with millions of nodes. VxOrd also applies edge cutting criteria, which leads to graph layouts exhibiting both local (orientation within groups) and global (group-to-group) structure. The combination of the initial node and edge structure and cutting criteria thus determine the number, size, shape, and position of natural groupings of nodes.

Validation of science maps is a difficult task. In the past, the primary method for validating such maps has been to compare them with the qualitative judgments made by experts, and has been done only for single-discipline-scale maps (see the background section of Klavans & Boyack [1] for more discussion).

A more pragmatic approach is to use the ISI journal classifications to evaluate the validity of the journal similarity measures and the corresponding maps. The ISI journal classification system, while it does have its critics, is based on expert judgment and is widely used. In principle, users would expect that pairs of journals with high similarity should be in the same ISI category. Journals in the same cluster of a journal mapping should have the same ISI category assignments. These assumptions are used to validate and compare the eight different similarity measures and corresponding graph layouts or maps.

In our previous work with the current data set, and the same eight similarity measures and maps from Figure 1, we investigated local accuracy and the effects on accuracy of reducing dimensionality with VxOrd [1] using the ISI category assignments as a reference basis. We found that, counterintuitively, use of VxOrd algorithm to convert similarities to map positions actually increased local accuracy. We also found that four of the inter-citation measures had roughly comparable local accuracy at 95% journal coverage, and recommended the IC-Cosine measure as the best overall measure.

In this work we focus on structural accuracy or the validity of the global structure of the solution space. To make quantitative comparisons of our eight maps of science, we implement a mutual information method recently used to distinguish between gene clustering algorithms. [37] This mutual information method requires a reference basis, for which we use the ISI journal category assignments.

To employ the method of Gibbons and Roth [37] we need to do a clustering of each of the maps. VxOrd gives (x,y) coordinate positions for each node, but does not assign cluster numbers to the nodes. Thus, k-means clustering was applied to each of the maps in Figure 1. Other clustering methods (e.g. linkage or density-based clustering) could have been used.

The CC-Raw map clearly performs the worst. The Z-scores for all other measures are near or above a value of 350, indicating that all of these measures give maps that are far from random. The IC-Pearson map gives the highest Z-score over nearly the entire range of cluster solutions. It is only at the higher end, from 200 through 250 clusters, that the IC-Jaccard map has a Z-score comparable to that of the IC-Pearson.

Hence, based on Z-scores it is likely that any of the six would be a suitable choice as the basis for an accurate map of science.

For a co-citation-based map, the CC-K50 measure is a clear winner for several reasons. Although the Z-score for the CC-K50 is nearly identical to that of the CC-Pearson, the K50 measure is scalable to much larger numbers of nodes, while the Pearson is a full N(square) calculation, and cannot easily scale much higher than the 7000 nodes used here. The CC-K50 map is a visually well-balanced map with a good distribution of cluster sizes and positions (see Figures 1 and 3). By contrast, the CC-Pearson map appears very stringy; clusters are very dense with less visual differentiation between disciplines, and thus not as suitable for presentation. The CC-K50 map also has a higher degree of local accuracy. [1]

Of these three, IC-Cosine, IC-Jaccard, and IC-Pearson, we choose to further characterize the IC-Jaccard as our best map due to its slightly higher Z-score, realizing that the Cosine map is in a virtual dead heat statistically, and the Pearson map only somewhat less in local accuracy.

Differences such as these between the maps at the discipline level are likely due to fine-scaled differences between the co-citation and inter-citation patterns. Yet, the overall consistency between the co-citation and inter-citation-based maps of science suggests the general structure described here is robust.

Figure 6 shows the clear distinction between two main areas within the LIS discipline. Although there are relationships between journals in the two clusters, the dominant relationships (darkest edges) are within clusters. The journals in the cluster at the upper left all focus on libraries and librarians and their work, while those in the cluster at the lower right are all focused on advances in information science.

Eight different similarity measures were calculated from the combined SCI/SSCI data and the resulting journal-journal similarity matrices were mapped using VxOrd. The eight maps were then compared based on two different accuracy measures, the scalability of the similarity algorithm, and the readability of layouts (clustering).

2013年4月16日 星期二

Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics,4(2), 185-193.

Zhang, L., Liu, X., Janssens, F., Liang, L., & Glänzel, W. (2010). Subject clustering analysis based on ISI category classification. Journal of Informetrics,4(2), 185-193.

information visualization

本研究探討ISI主題分類(ISI subject categories)之間的資訊流(information flow)。
利用期刊彼此間的引用資料,本研究首先測量了每一個主題分類在引用上的熵值(entropy)、自我連結的指標(self-link index)以及與其他主題分類的連結強度。在本研究中,熵值用來測量主題分類的引用連結是否廣泛分布在多個主題分類上,熵值愈高,主題分類的引用和被引用愈是平均分布在多個主題分類上。熵值最高的十個主題分類有九個屬於藝術與人文(arts and humanities)領域。去除所有藝術與人文領域的分類之後,最高的十個主題分類中,社會科學(social sciences)領域的主題分類占了大部分,另外則有電腦科學以及科際整合應用(interdisciplinary applications),這個結果與Leydesdorff and Rafols (2009)利用中介中心性(betweenness centrality)作為跨學科性(interdisciplinarity)指標的測量結果相符合。主題分類的自我連結指標可以表現出該分類的專殊化(specialisation),這項指標最高的三個主題分類是天文學與天文物理(astronomy and astrophysics)、數學(mathematics)以及法律(law)。本研究並且計算每一個主題分類與其他主題分類的連結強度,以超過0.05做為兩者間有較強的連結強度,統計每一個主題分類具有較強連結強度的數量,發現個主題與那些主題有較多的引用或被引用情形。對照主題分類在引用上的熵值,可以發現社會科學的主題分類其引用情形較廣泛分布於多個主題分類,因此有較高的熵值,自然科學的主題分類其引用則較為集中,容易發現部分較強的連結強度。本研究同時也以主題分類與其他主題分類的連結強度建立以該主題分類為中心的連結映射圖(ego-centred neighbour map)來呈現這個主題分類與其較相關的主題分類之間的關係。
最後本研究嘗試將主題分類利用多層次聚合方法(Multi-level Aggregation Method, MAM) (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008)加以分群,比較以主題分類為對象和以期刊為對象(Zhang, Janssens, et al., 2009)的分群結果。MAM是一種以模組性(modularity)為基礎的社群偵測(community detection)演算法。這個演算法,第一回合將每一個節點指定為一個叢集,然後開始嘗試將任何兩個叢集合併,計算合併後的模組性,一直到模組性無法增加為止,便結束這個回合;下一回合開始時,將合併到同一叢集的節點視為一個節點,再將每一個節點指定為一個叢集,然後重複上述的合併程序,一直到所有的節點都合併成一個叢集或是模組性無法增加為止(Blondel, Guillaume, Lambiotte, & Lefebvre, 2008)。本研究利用每一回合的結果做為不同解析度(resolution)的分群結果,選擇模組性最高時的解析度。Zhang, Janssens, et al. (2009)是以期刊的內容進行分群,並且根據期刊名稱來標示分群的結果,本研究的分群則是從期刊間的引用資料聚集為主題分類間的引用,再對主題分類進行分群,最後根據主題分類決定分群結果的標示。從實驗的結果,本研究認為這兩種結果在結構上有所差異,而此差異不僅和分群的方式有關,同時有相當多期刊具有重複的主題分類也有很大的影響。因此,本研究的分群結果有助於改善期刊的主題分類。

The present study will focus on the analysis of the information flow among the ISI subject categories. This will be done for two important reasons. This exercise aims at finding an appropriate field structure of the Web of Science using the subject clustering algorithm developed in previous studies. Furthermore, since ISI subject categories are based on journal assignment the question arises of what changes if journal cross-citation is replaced by subject cross-citation. If changes are not essential, the elaborate clustering of more than 8000 journals could be substituted by a somewhat easier analysis of roughly 250 ISI categories and the journal level could, as it were, be skipped.

Boyack, Klavans, and Börner (2005) applied eight alternative measures of journal similarity to a dataset of 7121 journals covering over one million documents in the combined Science Citation and Social Sciences Citation Indexes, to show a global map of science using the force-directed graph layout tool VxOrd.

Chen (2008) proposes an approach to classify scientific networks in terms of aggregated journal-journal citation relations of the ISI Journal Citation Reports using the affinity propagation method.

As mentioned in the outset, Zhang, Glänzel, et al. (2009) and Zhang, Janssens, et al. (2009) have also investigated different methods for the analysis and classifications of scientific journals.

Glänzel and Schubert (2003) designed a new classification scheme of science fields and subfields for scientometric evaluation purposes.

Moya-Anegon et al. (2004) proposed a new technique that uses thematic classification as entities of co-citation, and presented an ego-centred network of 222 ISI categories including science and social sciences.

Leydesdorff and Rafols (2009) classified the ISI 172 science categories into 14 groups based on factor analysis, and compared the interdisciplinarity of each category using betweenness centrality.

Compared to other researchers, we applied a new clustering technique to classify the ISI science and social sciences categories into 7 groups based on the category–category cross-citation similarities, and further compared the results with the 7 hybrid clustering solution of 8305 journals in a previous study (Zhang, Janssens, et al., 2009).

The data have been collected from the Web of Science of Thomson-Reuters. Altogether 9487 journals which were assigned to the 246 categories of sciences, social sciences and arts and humanities in the entire period of 2002–2006 were selected and only three document types, namely, article, letter and review, were taken into consideration. More than six million papers were indexed and citations have been summed up through a variable citation window, from the publication year till 2006.

The clustering method adopted in this study is the Multi-level Aggregation Method (MAM) (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008), which is a new clustering algorithm based on the modularity optimization.

Modularity (Newman, 2006) is a benefit function used in the analysis of networks or graphs such as computer networks or social networks. It quantifies the quality of a division of a network into modules or communities. Good divisions, having high modularity values, are those in which there are dense internal connections between the nodes within modules but only sparse connections between different modules.

The modularity of this division is defined to be the fraction of the edges that fall within the given groups minus the expected such fraction if edges were distributed at random.

The value of the modularity lies in the range [−1,1]. It is positive if the number of edges within groups exceeds the number expected on the basis of chance.

In MAM, firstly each node of the network is assigned to a single community.
The two nodes i and j are merged on the basis of the maximum modularity gain defined in Eq.(2).
The merging process is repeated until local maximum modularity is reached.
Then the current communities are employed to form super nodes to repeat the above merging process of nodes.
This process is applied repeatedly and sequentially for all nodes until no further improvement can be achieved.

When a local modularity maximum value is reached during the optimization, it will correspond to a cluster number from the formed communities.
Since generally there are several local modularity maximum values available during the optimization stage, these various cluster numbers under such modularity values can be regarded as different clustering levels (resolutions).
Therefore, we can find the most approximate number of clusters from these different clustering levels because the global modularity maximum value could be found among local modularity maximum values.
Thus Multi-level Aggregation Method provides a heuristic scheme to determine the number of clusters automatically.

The number of subject assignment in the Web of Science (SCIE, SSCI, AHCI) is 14,608 for 9487 journals during 2002–2006, namely, roughly 1.54 categories per journal. The average number of journals for each category is 59.4.

Taken into account the big share of multiple assigned journals, as well as the big share of journal self-citations, this aggregation will definitely impact or even distort the real network among categories. In order to avoid the latent distortion, we decided to exclude all the journal self-citation data before we got the aggregated category–category citation matrix. In other words, our category-to-category cross-citation matrix is aggregated from citations only among different journals.

In order to measure in how far references/citations are spread among other journals, Zhang, Glänzel, et al. (2009) have introduced the indicator of entropy. Here we used the same indicator to measure the distribution of links among different categories.

Table 1 shows the top 10 categories with highest entropies, where 9 of them are assigned to arts and humanities. This is not surprising as it is well-known that the arts and humanities tend to communicate with a large scope of different categories.

As a contrast, we present the top 10 categories with highest entropies after the exclusion of arts and humanities (see Table 2). Social sciences categories occupy a big share; computer science, interdisciplinary applications has the highest entropy among the science categories. This result is in accordance with the research of Leydesdorff and Rafols (2009), where they got the conclusion that computer science, interdisciplinary applications, is the one with the highest interdisciplinarity among all science categories, although they used another indicator: betweenness centrality.

Opposing the entropy which measures the distribution of links within the communication network, the index of self-link mainly represents the degree of isolation (see Eq.(5)).

The 10 most isolated categories are represented in Table 3, where the top three are, respectively, astronomy and astrophysics, mathematics and law. The striking values of SLI may indicate the high degree of specialisation, or the particular citation characteristics of these certain categories.

In the cross-citation network, the categories either merely spread their information over and/or collect information from a variety of other categories but regardless of their intensity (like cases in Table 2), or tend to have strong influence from/on some particular categories but relatively weak in expanding their communication scope (like cases in Table 5). In general, social sciences categories are inclined to enlarge
their link distributions, while science categories tend to have more intense links.

The categories shown in Table 5 could be considered as “central nodes” among the whole communication network. These “central” actors would form some coherent sub-clusters in the network, and act as “cores” in these clusters. It is worthwhile to have a look at those sub-clusters, where there are dense information communications.

There are indeed structural differences between the elaborate clustering of more than 8000 journals and the clustering of the ISI subject categories.
The former clustering results are generated automatically based on the journal-to-journal similarities, and are then labelled using the best TF-IDF terms from all documents under study in these individual journals; while in the clustering based on ISI subject categories, we first assign all the individual journals into different categories according to the ISI assignment and then aggregate all the journal-to-journal citation data to category-to-category citation data. The clustering is thus analyzed at the category level, and the labelling for each cluster is based on the names of ISI categories included.
Therefore, the two clustering results provide two subject classification schemes through different perspectives and levels. The two classifications are structurally comparable but differences indeed exist. The divergence between the two structures may be due to the interferences from the multiple journal assignment to ISI subject categories, and on the other hand, may also reflect some possible improvement of the journal assignment scheme in ISI.

2013年4月15日 星期一

McCain, K. W. (2008). Assessing an author's influence using time series historiographic mapping: The oeuvre of Conrad Hal Waddington (1905–1975). Journal of the American Society for Information Science and Technology, 59(4), 510-525.

McCain, K. W. (2008). Assessing an author's influence using time series historiographic mapping: The oeuvre of Conrad Hal Waddington (1905–1975). Journal of the American Society for Information Science and Technology, 59(4), 510-525.

本研究利用HistCite™ 軟體 (Garfield, Pudovkin, & Istomin, 2003a, 2003b; see also Garfield, 2004)製作歷史圖表(historiography),使Conrad Hal Waddington的作品(oeuvre)、引用這些作品的文獻以及引用連結(citation linkage)形成的網絡依據出版的時間順序呈現在圖形上,同時以NetDraw程式的 Newman–Girvan演算法(Girvan & Newman, 2002)確認在網路上具有高中介性的節點子群體。利用這個方式可以呈現研究狀態的脈絡並且發現連貫的研究串流(research stream),以便歷時與跨領域地追蹤研究的想法和方法的擴散情形。

information visualization

A modified approach to algorithmic historiography is used to investigate the changing influence of the work of Conrad Hal Waddington over the period 1945–2004.

Much of this “citations as-indicators” work falls under the heading of evaluative bibliometrics and science & technology indicators (see, e.g., Moed, 2005).

Citation data can also be used to trace the diffusion of ideas and methodologies over time and across knowledge domains. They offer a window through which to view the development, growth, and decline of research fields and the impact or influence of both individual works and individual researchers, represented as bodies of cited work (oeuvres; see, e.g., Börner, Chen, & Boyak, 2003; De Mey, 1992; White & McCain 1989, 1997).

There are a number of different approaches to creating and visualizing longitudinal citation networks based on large document collections—examples include mapping of cocitation cluster strings (Small, 1977; Small, 2006; Small & Greenlee,1986; Small & Greenlee, 1989), textual and cocitation data mining (Chen, 2006), variable timelines based on core documents (Morris, Yen, Sheng, & Asnake, 2004), and main path analysis in temporal citation networks (Carley, Hummon, & Harty, 1993; Hummon & Doreian 1989; Hummon, Doreian, & Freeman, 1990). Related to the last approach is the general idea of algorithmic historiography, supported by the HistCite™ software (Garfield, Pudovkin, & Istomin, 2003a, 2003b; see also Garfield, 2004).

The associated historiographs are acyclic graphs—visualizations of the citation linkages among a subset of the documents that are all cited above a researcher-specified citation threshold. As in main path analysis, the networked documents are arranged in temporal publication sequence (older to younger) to show how research streams develop as newer work incorporates the information in older cited work and is, in turn, cited itself.

Published analyses using HistCite™ include a range of topics in information science (Garfield, 2004), networks associated with the discovery of DNA (Garfield, Pudovkin, & Istomin, 2003b), neural networks (McCain, 2005a), medical informatics (McCain & Silverstein 2006), and global environmental change (Janssen, Schoon, Ke, & Börner, 2006). In addition, Garfield has posted almost 400 analyzed document sets on his Web site (Garfield, 2006).

The challenge in tracing change over time bibliometrically is to retain the ability to see the world as it was at some particular time in the past and not just cumulatively, from the perspective of today.

Timeline visualization, main path analysis, and historiographic mapping have typically been applied to aggregate topical or journal-bound data sets with the temporal aspect being provided by the arrangement of the highly cited works.

The cumulative aspect of historiographic mapping becomes a particular problem if one wants to focus on a particular author rather than a broad topic or journal run as the basis for building the document set. The data set to be studied would, naturally, focus on the author’s oeuvre and the literature that cites it. Thus, the publications cited above any useful citation threshold will primarily be those of the author, with only the most prominent non-oeuvre-based citing articles visible in the network.

In this article, I demonstrate one way that historiographic mapping using Garfield’s HistCite™ software can be adapted to capture the impact of an author’s work. It preserves the context of the state of research at various points in time and allows identification of coherent research streams within a larger citing/cited network. This is accomplished by modifying and enhancing the data input, focusing on separate, successive periods, and using social network analysis software to build and analyze the historiograph.

Within-network research themes were identified usin NetDraw’s Newman–Girvan algorithm. This algorithm (Girvan & Newman, 2002) identifies subgroups of nodes within a network that have “high-betweenness centrality” (see Chen, 2006).

Waddington’s work is differentially cited by two distinct research communities. In Figure 1, it can be seen that on the left is a single, densely connected subnetwork focusing on canalization/genetic assimilation. To the center and the right, there are four subnetworks relating to experimental embryology and embryonic induction: amphibian embryology, theoretical models and experimental embryology, embryology of Drosophila, and embryonic induction.

2013年4月14日 星期日

Wallace, M. L., Gingras, Y., & Duhon, R. (2009). A new approach for detecting scientific specialties from raw cocitation networks. Journal of the American Society for Information Science and Technology, 60(2), 240-246.

Wallace, M. L., Gingras, Y., & Duhon, R. (2009). A new approach for detecting scientific specialties from raw cocitation networks. Journal of the American Society for Information Science and Technology, 60(2), 240-246.

network analysis

本論文評估Blondel社群偵測演算法(Blondel, Guillaume, Lambiotte,& Lefebvre, 2008)應用於作者共被引分析的可行性,希望根據網路的型態(topology)來找出網路圖上可能的節點叢集,確認學術領域內的專業(specialties)。這個方法能夠解決目前在作者共被引分析的四個研究問題:1) 能夠利用作者間連結線上不同的權重;2) 能夠利用單一的最佳化函數(optimization function)自動判斷社群的數目;3) 以模組性(modularity)來判斷網路上的社群偵測結果,無須受限於網路的大小與型態;4) 以網路本身既有的結構區分網路上的社群,無需事先對網路本身進行改變。由於共被引網路上同一叢集的節點之間具有較高的共被引次數,本研究認為模組性適合利用來測量共被引網路的社群結構。本研究利用Blondel演算法做為社群偵測演算法的原因是由於它具有運算時間較短和能夠較敏銳發現局部結構等兩項優點。本研究將兩個作者共被引的資料集輸入Blondel社群偵測演算法來驗證這個方法的可行性:一為資訊檢索及書目計量學研究各12位高被引作者的共被引資料,另一為整個科學研究領域的前100位高被引作者的共被引資料。

There is an increasingly large amount of literature devoted to the treatment of cocitation data, either of papers, authors, or journals. Most of these studies use this readily available information to map the structure of science or identify different clusters of scientific research. The idea behind this type of work, initially developed and used by H. Small and others (Bayer, Smart, & McLaughlin, 1990; Marshakova, 1973; Small, 1973; Small&Griffith, 1974; Small&Sweeney, 1985; White, 1981; White & McCain, 1981) is to use cocitations as the foundation of a conceptual network that evolves in time based on the choices (i.e., citation practices) of scientists themselves (Small, 1978).

A recent debate on the appropriate similarity measures to evaluate the “proximity” of agents (Ahlgren, Jarneving, & Rousseau, 2003, 2004; Bensman, 2004; Leydesdorff, 2008; Leydesdorff & Vaughan, 2006; White, 2003) highlights the need for an alternative methodology for detecting local research communities corresponding to scientific specialties, preferably without having to map the data onto another vector space.

In this article, we evaluate a new community detection method (Blondel, Guillaume, Lambiotte,& Lefebvre, 2008) used for identifying, without any free parameters, pre- or postprocessing of data, scientific specialties for any given cocitation network.

Our new approach is motivated by four requirements with respect to the clustering of cocitation networks.

First, the weight of the links between authors (no. of cocitations) is crucial; this is where most of the information is contained. Therefore, any network-based approach must be able to take into account not only the existence of links between authors but also how strong these links are.

Second, there should be no “choice” made by the user regarding which clusters to identify nor should there be any a priori limitations as to the number of communities or to their population; a single optimization function or algorithm should provide an independent division of the network.

Third, aside from the case of extremely large networks, there should be no restrictions on the size or topology of the network used. Naturally, some networks have a more clear-cut community structure than do others, and this should be apparent or quantifiable. In our case, the modularity of the decomposed networks provides a good indicator of this.

Finally, there should not be any a priori assumptions on the networks themselves. In other words, they need not be altered in any way before applying the algorithm and only their inherent structure should be used to determine how they should be partitioned.

The Girvan–Newman (GN) algorithm (Girvan & Newman, 2002; Newman & Girvan, 2004) is well-known as the canonical method for community detection in complex networks. This method has recently been successfully applied to identify research themes within a citation network (McCain, 2008). Essentially, the algorithm consists in cutting links with high values of betweenness (in terms of geodesics passing through a link) and monitoring the graph’s modularity Q, loosely defined as a measure of how meaningful a given division of the network into subgroups is, while taking into account the number of random links that would be expected within a subgroup.

Modularity is not an appropriate measure for community structures in all networks. ... Cocitation networks, though, are well-suited to the measure. Even a negative citation alongside a positive citation implies some topical similarity between the two articles.

However, a standard implementation of the GN algorithm for weighted cocitation networks is not straightforward and is extremely expensive in computational time.

The GN algorithm successfully identifies communities on the periphery of the network, but almost never cuts “heavy” links (i.e., high numbers of cocitations), even though this is occasionally necessary to bring out the communities inherent in the network. In real, relatively compact cocitation networks with “strong” (but divisible) cores where practically everyone is co-cited with everyone at some point in time, an algorithm organized in this way can be of some heuristic value, but will have great difficulty uncovering the optimal community structure.

The algorithm of Blondel et al. (2008) balances optimization of modularity with running time and sensitivity to local structure.

Each node first is placed in separate communities. Iterating over all nodes, one checks if moving the node from its current community to any community to which a neighbor belongs would yield an increase in modularity. If so, one moves the node to the neighboring community that gives the highest increase in modularity and continues the process until equilibrium is reached.

Then, one projects each community as a single node in a new network, with edges between community-nodes where there were edges between nodes in the communities in the original network. The weights of the new edges are obtained by summing over all previous weights (including self-loops).

Finally, the entire process is repeated until there is no change in the community structure. ... The result is a hierarchy of communities for the network.

While many other methods discussed can be successfully applied to networks that are small, or where the communities are fairly clear-cut, we believe that a rigorous utilization of cocitation data generally results in much more dense or convoluted networks, and thus requires a more robust approach.

Furthermore, we believe that it is imperative that subjective treatment of the data be avoided as much as possible in cocitation analysis.

These techniques could be of great use to historians or sociologists of science, by tracking the emergence, demise, proximity, or fusion of specializations as well as the evolution of scientific paradigms (Chubin, 1976; Mullins, 1972; Mullins et al., 1977; Small, 2006). Given a specific community, we can identify—using keywords for instance—its ideas, methods, and membership.

Waltman, L., van Eck, N. J., & Noyons, E. (2010). A unified approach to mapping and clustering of bibliometric networks. Journal of Informetrics, 4(4), 629-635.

Waltman, L., van Eck, N. J., & Noyons, E. (2010). A unified approach to mapping and clustering of bibliometric networks. Journal of Informetrics, 4(4), 629-635.

information visualization

本研究提出整合書目計量網絡(bibliometric networks)的映射(mapping)與叢集(clustering)的技術。映射與叢集技術經常一起用於分析書目網絡的結構,以了解科學領域的重要研究主題、研究主題之間的關係和領域的發展等問題。在過去對於書目網絡的相關研究中,有些研究建構一個映射圖來呈現網絡上的節點,並且在圖上呈現節點的叢集情形,例如McCain (1990)、White & Griffith (1981)、Leydesdorff & Rafols (2009)和Van Eck, Waltman, Dekker, & Van den Berg (in press);有些研究則先對節點進行叢集,然後建構一個映射圖來呈現節點的叢集,例如Small, Sweeney, & Greenlee (1985)和Noyons, Moed, & Van Raan, (1999);第三種方式則是先建構一個呈現節點的映射圖,再利用節點在映射圖上的座標進行叢集,例如Boyack, Klavans, & Börner (2005)和Klavans & Boyack (2006)。在書目計量學和科學計量學的研究裡常使用的映射與叢集技術組合是以多維縮放(multidimensional scaling)和 階層叢集(hierarchical clustering)技術的組合,著名的早期研究有McCain (1990)、Peters &Van Raan (1993)、Small et al., (1985)和White & Griffith (1981)。其他知名的映射技術還有經常配合尋徑者網路縮放方法(pathfinder network scaling)的Kamada and Kawai (1989)映射演算法,例如 Chen (1999)、de Moya-Anegón et al. (2007)和White (2003)等,Boyack等研究者提出的VxOrd(Boyack et al., 2005; Klavans & Boyack, 2006)和Van Eck等研究者提出的VOS (Van Eck et al., in press)也都是常被使用的映射技術。在叢集方面,除了階層式叢集以外,因素分析(factor analysis)也常被使用,例如de Moya-Anegón et al. (2007)、Leydesdorff & Rafols (2009)和Zhao & Strotmann (2008)等研究,近年來書目計量學和科學計量學的研究裡經常被應用的技術是建立在Newman and Girvan (2004)提出的模組性函數(modularity function)的叢集技術,例如Chen & Redner (2010)、Lambiotte & Panzarasa (2009)、Schubert & Soós (2010)、Takeda & Kajikawa (2009)、Wallace, Gingras, & Duhon, (2009)和Zhang, Liu, Janssens, Liang, & Glänzel (2010)。然而正如以上的分析,映射與叢集技術一起用於分析書目網絡的結構的技術,雖然極為相關,但是大多是獨立發展。本研究便是基於這個問題,藉由對於過去發展的VOS映射技術以及以模組性為基礎的叢集技術引導出一致的原則,建立這兩種技術的關連(relation),來進行整合。另一個整合映射和叢集技術的研究Noack (2009)則定義了一個參數化的目標函數(a parameterized objective function)來描述一類的映射技術, 並且證明以模組化為基礎的叢集技術也可以納入這個目標函數,因此可以建立映射和叢集技術之間的關係。本研究與Noack(2009)的不同在於本研究提出的方法直接建立VOS映射技術和模組性為基礎的叢集技術之間的關係,而不是透過目標函數做為映射和叢集技術之間的關係,並且也包含一個權重因素(weighing factor),最後本研究的方法利用解析度(resolution)參數來解決模組性為基礎的叢集技術在解析度上的問題。為了驗證這個技術的可行性,本研究並且以資訊科學在1999到2008年間最常被引用的1242筆文獻進行映射和叢集,利用書目耦合和共被引次數的總和來估計文獻間的關連程度,產生的結果圖形上可以觀察到在資訊科學的結構中包含資訊尋求和檢索(information seeking and retrieval)以及資訊計量學(informetrics)兩個大的次領域,這個結果與其他以資訊科學為分析對象的書目計量研究相似。

In bibliometric and scientometric research, a lot of attention is paid to the analysis of networks of, for example, documents, keywords, authors, or journals. Mapping and clustering techniques are frequently used to study such networks.The aim of these techniques is to provide insight into the structure of a network. The techniques are used to address questions such as:
• What are the main topics or the main research fields within a certain scientific domain?
• How do these topics or these fields relate to each other?
• How has a certain scientific domain developed over time?
To satisfactorily answer such questions, mapping and clustering techniques are often used in a combined fashion.

One approach is to construct a map in which the individual nodes in a network are shown and to display a clustering of the nodes on top of the map, for example by marking off areas in the map that correspond with clusters (e.g., McCain, 1990; White & Griffith, 1981) or by coloring nodes based on the cluster to which they belong (e.g., Leydesdorff & Rafols, 2009; Van Eck, Waltman, Dekker, & Van den Berg, in press).

Another approach is to first cluster the nodes in a network and to then construct a map in which clusters of nodes are shown. This approach is for example taken in the work of Small et al. (e.g., Small, Sweeney, & Greenlee, 1985) and in earlier work of our own institute (e.g., Noyons, Moed, & Van Raan, 1999).

A third approach is to first construct a map in which the individual nodes in a network are shown and to then cluster the nodes based on their coordinates in the map (e.g., Boyack, Klavans, & Börner, 2005; Klavans & Boyack, 2006).

In the bibliometric and scientometric literature, the most commonly used combination of a mapping and a clustering technique is the combination of multidimensional scaling and hierarchical clustering (for early examples, see McCain, 1990; Peters&Van Raan, 1993; Small et al., 1985; White&Griffith, 1981).

A popular alternative to multidimensional scaling is the mapping technique of Kamada and Kawai (1989); (see e.g. Leydesdorff & Rafols, 2009; Noyons & Calero-Medina, 2009), which is sometimes used together with the pathfinder network technique (Schvaneveldt, Dearholt, & Durso, 1988; see e.g. Chen, 1999; de Moya-Anegón et al., 2007; White, 2003). Two other alternatives to multidimensional scaling are the VxOrd mapping technique (e.g., Boyack et al., 2005; Klavans & Boyack, 2006) and our own VOS mapping technique (e.g., Van Eck et al., in press).

Factor analysis, which has been used in a large number of studies (e.g., de Moya-Anegón et al., 2007; Leydesdorff & Rafols, 2009; Zhao & Strotmann, 2008), may be seen as a kind of clustering technique and, consequently, as an alternative to hierarchical clustering. Another alternative to hierarchical clustering is clustering based on the modularity function of Newman and Girvan (2004); (see e.g. Wallace, Gingras, & Duhon, 2009; Zhang, Liu, Janssens, Liang, & Glänzel, 2010).

In bibliometric and scientometric research, modularity-based clustering has been used in a number of recent studies (Chen & Redner, 2010; Lambiotte & Panzarasa, 2009; Schubert & Soós, 2010; Takeda & Kajikawa, 2009; Wallace et al., 2009; Zhang et al., 2010).

As we have discussed, mapping and clustering techniques have a similar objective, namely to provide insight into the structure of a network, and the two types of techniques are often used together in bibliometric and scientometric analyses. However, despite their close relatedness, mapping and clustering techniques have typically been developed separately from each other.

In our view, when a mapping and a clustering technique are used together in the same analysis, it is generally desirable that the techniques are based on similar principles as much as possible. This enhances the transparency of the analysis and helps to avoid unnecessary technical complexity. Moreover, by using techniques that rely on similar principles, inconsistencies between the results produced by the techniques can be avoided.

In this paper, we propose a unified approach to mapping and clustering of bibliometric networks. We show how a mapping and a clustering technique can both be derived from the same underlying principle. In doing so, we establish a relation between on the one hand the VOS mapping technique (Van Eck &Waltman, 2007; Van Eck et al., in press) and on the other hand clustering based on a weighted and parameterized variant of the well-known modularity function of Newman and Girvan (2004).

It follows from (6) and (7) that our proposed clustering technique can be seen as a kind of weighted variant of modularity-based clustering (see Appendix B for a further discussion). However, unlike modularity-based clustering, our clustering technique has a resolution parameter . This parameter helps to deal with the resolution limit problem (Fortunato & Barthélemy, 2007) of modularity based clustering. Due to this problem, modularity-based clustering may fail to identify small clusters. Using our clustering technique, small clusters can always be identified by choosing a sufficiently large value for the resolution parameter .

The above result showing how mapping and clustering can be performed in a unified and consistent way resembles to some extent a result derived by Noack (2009). Noack defined a parameterized objective function for a class of mapping techniques (referred to as force-directed layout techniques by Noack). This class of mapping techniques includes for example the well-known technique of Fruchterman and Reingold (1991). Noack showed that his parameterized objective function subsumes the modularity function of Newman and Girvan (2004). In this way, Noack established a relation between on the one hand a class of mapping techniques and on the other hand modularity-based clustering.

First, the result of Noack does not directly relate well-known mapping techniques such as the one of Fruchterman and Reingold to modularity-based clustering. Instead, Noack’s result shows that the objective functions of some well-known mapping techniques and the modularity function of Newman and Girvan are special cases of the same parameterized function. Our result establishes a direct relation between a mapping technique that has been used in various applications, namely the VOS mapping technique, and a clustering technique.

Second, the mapping and clustering techniques considered by Noack and the ones that we consider differ from each other by a weighing factor. This is the weighing factor given by (7).

Third, the clustering technique considered by Noack is unparameterized, while our clustering technique has a resolution parameter.

In Fig. 1, we show a combined mapping and clustering of the 1242 most frequently cited publications that appeared in the field of information science in the period 1999–2008. The mapping and the clustering were produced using our unified approach.

For these publications, we determined the number of co-citation links and the number of bibliographic coupling links. These two types of links were added together and served as input for both our mapping technique and our clustering technique.

The combined mapping and clustering shown in Fig. 1 provides an overview of the structure of the field of information science. The left part of the map represents what is sometimes referred to as the information seeking and retrieval (ISR) subfield (Åström, 2007), and the right part of the map represents the informetrics subfield.

The clustering shown in Fig. 1 consists of 25 clusters. The distribution of the number of publications per cluster has a mean of 49.7 and a standard deviation of 31.5.

2013年4月8日 星期一

Rorissa, A. and Yuan, X. (2012). Visualizing and mapping the intellectual structure of information retrieval. Information Processing and Management, 48, 120-135.

Rorissa, A. and Yuan, X. (2012). Visualizing and mapping the intellectual structure of information retrieval. Information Processing and Management, 48, 120-135.

network analysis

Chen (2006)說明一個學術領域或學科的知識基礎(intellectual base)與研究前沿(research front)的區別:研究前沿是一個專業(specialty)當前最先進的技術狀態(state of the art),由研究前沿引用所構成的部分則是它的知識基礎。通常在分析學術領域或學科的知識基礎時,大多利用期刊的引用資料,並且使用群集分析(cluster analysis)、多維分析(multidimensional analysis)與其他技術將引用資料的視覺化 (例如:Chen & Kuljis, 2003; Chen et al., 2010; Ding et al., 2000; McKechnie, Goodall, LajoiePaquette, & Julien, 2005; Tang, 2004; White & Griffith, 1981; White & McCain, 1998)。White (2003)則是使用尋徑網路(Pathfinder Networks)技術將White & McCain (1998)的資料繪製成圖書資訊學領域的科學映射圖。這些技術也都被應用於軟體工具的製作並且提供研究人員自由使用,例如CiteSpace。將引用資料等資訊繪製成科學映射圖的研究興趣提升可以歸納以下的原因:可使用的引用資料來源以及其他廣泛出現;許多提供視覺化與映射圖的電腦應用程式可以自由使用;對不斷增加數量的電子資料需要具有包容性與容易使用的管理與理解方法。

這篇研究利用2000到2009年間的資訊檢索領域的引用資料進行一系列的分析與視覺化,所利用的資料來源是Web of Science,共使用48,390筆書目紀錄,使用的視覺化工具是由Chen(2004a; 2004b; 2006)提出的CiteSpace (http://cluster.cis.drexel.edu/cchen/citespace/)。研究的項目與結果如下:

1) 作者合著網路(co-authorship network): 以32位發表10篇以上的較高生產力作者為分析對象,,其中一半來自於電腦科學,另一半則為資訊科學。只要兩位作者曾一起出現在一篇論文中便在他們的映射點間建立連結。網路建立好之後,測量每位作者映射點的中介性,找出中心作者。結果可以發現兩個作者數目超過八位的較大作者叢集,中心分別是 Jarvelin K和Chen HC,前者來自資訊科學,而後者則是來自電腦科學。本研究推論作者的生產力較高,同時也將有較多的合著對象。
2) 高被引的期刊與論文:在這個研究裡發現約有43%的高被引文獻是在1970到1989年間出版,顯示資訊檢索已是一個相當成熟的領域。另外,生產力較高的作者群與高被引文獻間的關連不明顯。引用最高的出版品與作者都是來自電腦科學,引用最高的前三名作者以及他們的引用次數佔全部引用數的比率分別是Salton (26.1%), Jansen (9.88%), and Baeza-Yates (8.38%)。

3)作者指定的關鍵詞(author-assigned keywords):在關鍵詞所形成的共現網路裡,除了information retrieval,中介性較高的關鍵詞還包括information seeking、information system、evaluation和user studies。另外,在網路圖上可以明顯看出這些詞語形成的四個主要的研究集群為1)使用者研究(user studies)、網路資訊檢索(Web information retrieval)、3)引用分析/科學計量學(citation analysis/scientometrics)、和4)資訊檢索系統評估(information retrieval system evaluation)。
4) 活躍的機構(active institutions):資訊檢索領域的主要研究機構大部分位於美國,並且絕大多數是學術機構或大學。利用作者的合著關係將這些研究機構的合作情形畫成網路圖,結果發現這個研究領域相當鼓勵跨機構和跨國的研究。

5) 來自於其他領域的想法(the import of ideas from other disciplines):從論文的引用關係發現,資訊檢索研究的想法主要來自以下的五個領域:電腦科學(computer science)、圖書資訊學(library and information science)、工程(engineering)、電信傳播(telecommunications)和管理(management),高達91.6%的引用來自這五個領域。
We analyzed citation data in the information retrieval subfield for the past decade (2000–2009) and presented the results in terms of co-authorship network, highly productive authors, highly cited journals and papers, author-assigned keywords, active institutions, and the import of ideas from other disciplines.
An indicator of the maturity of an area of inquiry is the growth in the number and quality of research publications (Van den Beselaar & Leydesdorff, 1996). Insight into the nature of a field can be gained by examining ‘‘the publications produced by its practitioners. To the extent that practitioners in the field publish the results of their investigations, this mode for assessing the state of a field can reflect with great specificity the content and problem orientations of the group. Of the many ways that publications can be analyzed and counted, perhaps the most revealing kind of data are the references cited by the practitioner group in their publications’’ (Small, 1981, p. 39).
A field, discipline, or other area of study can be broadly divided into an intellectual base and current research fronts (Chen, 2006). ‘‘If we define a research front as the state of the art of a specialty (i.e., a line of research), what is cited by the research front forms its intellectual base’’ (Chen, 2006, p. 360). Previous literature of a discipline cited in its current publications (i.e., its intellectual base) can inform us about current research fronts. It is through references to sources that authors make connections between concepts (Small, 1981). Collectively, such connections create a ‘‘representation of the cognitive structure of the research field’’ (Small, 1981, p. 39).
Although a number of studies have examined the literature of library and information science (e.g., Åström, 2007, 2010; Chen, Ibekwe-SanJuan, & Hou, 2010; Cronin & Meho, 2007; Donohue, 1972; Harter & Hooten, 1992; Harter, Nisonger, & Weng, 1993; Persson, 1994; Rice, 1990; White & McCain, 1998; Zhao & Strotmann, 2008a, 2008b), the nature of the literature concerning information retrieval has not been so thoroughly investigated (Ding, Chowdhury, & Foo, 2000; Ding, Yan, Frazho, & Caverlee, 2009; Ellis, Allen, & Wilson, 1999).
According to Chen (2006), the trends and patterns of scientific literatures provide researchers or communities of similar interests an overview of the related field(s) and relationships among the specific fields. More specifically, such information as the most influential articles or books, the evolvement of terms, noun phrases, keywords, the most reputed researchers, the connection between different institutions and countries over time can show trends and patterns that provide more overview.
The most prominent conclusions address the stable, multidisciplinary nature of the field. For instance, Persson (1994) found that, for a 5-year period (1986–1990), the intellectual base of the Journal of the American Society for Information Science (consisting of the most frequently co-cited authors) and its research front (consisting of articles sharing at least five cited authors) had similar maps (depicted in two-dimensional spaces). This finding highlights the stable nature of the topics explored by the information studies field.
Ding et al. (2000) conducted one of the earliest studies on the literature of information retrieval specifically. They analyzed co-citation data of 50 highly cited journals using multidimensional scaling and cluster analysis. They produced two-dimensional maps of the structure of the literature of information retrieval for an 11-year period (1987–1997). The visualizations revealed strong relationships between information retrieval and five other disciplines: computer science, physics, chemistry, psychology, and neuroscience. Their maps also show information retrieval as being part of both the computer science and LIS fields.
Tang (2004) identified the most common disciplines to which LIS exports ideas (based on the number of citations it received from the disciplines):
computer science, communication, education, management, business, and engineering. Another study looked at the export and import of ideas to and from LIS and found it to be an exporter of ideas (Cronin & Meho, 2008). That contrasts sharply with the state of the field 20 years ago when few researchers from other disciplines cited LIS literature (Cronin & Pearson, 1990).
Result
Analysis of authorship and co-authorship is critical to the understanding of scholarly communication and knowledge diffusion (Chen, 2006).
To show the extent of collaboration by the most productive researchers in our dataset, we used CiteSpace to create a coauthorship network map. Two researchers have co-authorship if they have co-authored at least one paper together. CiteSpace generates networks by measuring betweenness centrality scores. ‘‘In a network, the betweenness centrality of a node measures the extent to which the node plays a role in pulling the rest of the nodes in the network together. The higher the centrality of a node, the more strategically important the node is.’’ (Chen et al., 2009, p. 236).
Two of the author co-citation clusters are large enough to contain at least eight members/authors. Two highly productive authors (see Table 1) anchor the two clusters: Jarvelin K and Chen HC. ... These results point to a relation between collaboration frequency and the most productive authors. We are tempted to conclude that the more actively an author collaborates, the more productive she or he is. Further research is necessary to confirm this assertion.

Chen, Cribbin, Macredie, and Morar (2002) showed that visualization can be used to track the development of a scientific discipline and present the long-term process of its competing paradigms. They also assert that, among a discipline’s co-cited publications, the cluster consisting of the most highly cited publications may represent the discipline’s core or predominant paradigm.
A product of CiteSpace, Fig. 2 displays a document co-citation network, generated from the collective citing behavior in our information retrieval dataset. The network is composed of 121 reference nodes and 1163 co-citation links.

Author-assigned keywords can reveal specific focus areas of research in a field.
As suggested by Table 4, ‘‘information retrieval’’ is located near the center of the cloud of keywords. Fig. 4 also indicates that other related terms have taken on a central role in the subfield: ‘‘information seeking,’’ ‘‘information system,’’ ‘‘evaluation,’’ and ‘‘user studies.’’ This highlights the rising emphasis on user-centered system design and retrieval, as well as the importance of user studies in the evaluation of IR systems. Information retrieval research is stronger today because it has increasingly focused on user-centered design. Current user studies research is more about ‘‘users’ interaction with information retrieval systems than about user information behavior in general’’ (Zhao & Strotmann, 2008a, p. 2077).
Fig. 4 also suggests that the information retrieval subfield has its own special areas of inquiry. Four main clusters can be discerned on the visualization map, centered around user studies, Web information retrieval, citation analysis/scientometrics, and information retrieval system evaluation.
Fig. 5 maps the collaboration between the top 20 institutions of information retrieval authors in our dataset. ... These diverse groupings indicate that the information retrieval subfield encourages collaboration across institutions and countries.
Information retrieval researchers in our dataset cite primarily computer science and library and information science publications (see Table 6). Those two fields account for 82.79% of the citations. ... Apart from LIS and computer science, the third, fourth, and fifth other disciplines from which information retrieval imports ideas are engineering, telecommunications, and management, respectively. In fact, 91.6% of the citations by information retrieval authors whose articles were published between 2000 and 2009 were to these five disciplines.