2013年4月29日 星期一

Van Eck, N. J., Waltman, L., Noyons, E. C., & Buter, R. K. (2010). Automatic term identification for bibliometric mapping. Scientometrics, 82(3), 581-596.

Van Eck, N. J., Waltman, L., Noyons, E. C., & Buter, R. K. (2010). Automatic term identification for bibliometric mapping. Scientometrics, 82(3), 581-596.

information visualization

詞語地圖(term map)能夠將科學領域的結構視覺化。在這裡,詞語指的是能夠代表領域特定概念的詞(words)或片語(phrase),詞語地圖便是為了呈現出領域內重要的詞語之間的關係所產生的圖形。為了製作詞語地圖,本研究提出自動詞語確認(automatic term identification)的方法,以減少專家勞力並避免主觀判斷帶來的問題。考慮到從語料庫確認的詞語必須同時具有單元完整性(unithood)和主題相關性(termhood)兩方面的特質 (Kageura and Umino, 1996),本研究建議的方法包括三個階段:第一階段利用詞類標示器(part-of-speech tagger)產生出來的結果(Schmid,  1994; Schmid, 1995),抽取輸入語料內的名詞片語,做為候選詞語。第二階段比較候選詞語的出現頻率和候選詞語內的第一個詞與其餘部分的出現頻率,計算概似比(likelihood ratio) (Dunning, 1993),評估它們為完整語意單位(semantic unit)的程度,挑選單元完整性比較高的候選詞語。第三階段計算詞語的主題相關性是本研究的重要貢獻。在確認單元完整並與主題相關的詞語後,以每一對詞語之間相關強度(association strength) (Van Eck and Waltman 2009)的值代表它們之間的關係,利用VOS技術 (Van Eck and Waltman 2007a)產生詞語地圖。
本研究建議利用詞語在各主題上的分布傾向來估計每一個詞語的主題相關性。在本研究裡具有較高主題相關性的詞語是只與某一個或較少數主題有較強的關連的詞語。所以對於每一個詞語,本研究建議比較此一詞語在各主題上的分布情形與原先各主題的分布情形,如果差異較大便表示該詞語的主題相關性較高,也就是具有較高主題相關性的詞語對於少數的主題具有區辨力(discriminatory)。但由於每一篇文件都可能包括多個主題,無法單純地統計詞語在各主題上的分布情形以及各主題的分布情形,因此本研究利用機率式隱含語意分析(probabilistic latent semantic analysis, PLSA)的方式(Hofmann, 2001)估計各種詞語在各主題上的分布情形,並與原先各主題的分布情形相比較,找出分布偏向於少數主題的詞語。
評估自動化詞語確認的結果相當困難(Pazienza et al., 2005)。本研究為了評估詞語確認的結果,以15種 ISI主題分類為作業研究(operational research)的期刊,建立了該領域的詞語地圖並且以兩種方式進行評估:第一種方式是比較這種方法與沒有使用PLSA的詞語確認和利用詞語的出現頻率(frequency of occurrence)選取詞語等其他兩種方法的回收率(recall)與精確率(precision)。第二種方式則是由作業研究領域的專家對產生出來的詞語地圖進行品質審核。第一種評估方法的結果顯示除了在最高和最低的回收率以外,本研究建議的方法都比其他兩種方法能夠得到更高的精確率。在專家審查的結果則發現本研究產生的詞語地圖能夠表現出作業研究領域可分為以方法論為導向(methodology-oriented)及以應用為導向(application-oriented)的兩類研究主題,這個結果相當符合專家的想法。但是目前的結果也呈現出這個方法獲得意義較為廣泛的詞語、圖形上沒有包括某些主題以及有些主題非常相近的詞語在圖形上彼此間並不靠近等問題。

A term map is a map that visualizes the structure of a scientific field by showing the relations between important terms in the field.

To evaluate the proposed methodology, we use it to construct a term map of the field of operations research. The quality of the map is assessed by a number of operations research experts.

Other maps show relations between words or keywords based on co-occurrence data (e.g., Rip and Courtial 1984; Peters and Van Raan 1993; Kopcsa and Schiebel 1998; Noyons 1999; Ding et al. 2001). The latter maps are usually referred to as co-word maps.

By a term we mean a word or a phrase that refers to a domain-specific concept. Term maps are similar to co-word maps except that they may contain any type of term instead of only single-word terms or only keywords.

Selection of terms based on their frequency of occurrence in a corpus of documents typically yields many words and phrases with little or no domain-specific meaning. Inclusion of such words and phrases in a term map is highly undesirable for two reasons. First, these words and phrases divert attention from what is really important in the map. Second and even more problematic, these words and phrases may distort the entire structure shown in the map.

However, manual term selection has serious disadvantages as well. The most important disadvantage is that it involves a lot of subjectivity, which may introduce significant biases in a term map. Another disadvantage is that it can be very labor-intensive.

Given a corpus of documents, we first identify the main topics in the corpus. This is done using a technique called probabilistic latent semantic analysis (Hofmann 2001). Given the main topics, we then identify in the corpus the words and phrases that are strongly associated with only one or only a few topics. These words and phrases are selected as the terms to be included in a term map.

An important property of the proposed methodology is that it identifies terms that are not only domain-specific but that also have a high discriminatory power within the domain of interest. This is important because terms with a high discriminatory power are essential for visualizing the structure of a scientific field.

We define unithood as the degree to which a phrase constitutes a semantic unit. Our idea of a semantic unit is similar to that of a collocation (Manning and Schu¨tze 1999). Hence, a semantic unit is a phrase consisting of words that are conventionally used together. The meaning of the phrase typically cannot be fully predicted from the meaning of the individual words within the phrase.

We define termhood as the degree to which a semantic unit represents a domain-specific concept.

Linguistic approaches are mainly used to identify phrases that, based on their syntactic form, can serve as candidate terms.

Statistical approaches are used to measure the unithood and termhood of phrases.

Most terms have the syntactic form of a noun phrase (Justeson and Katz 1995; Kageura and Umino 1996). Linguistic approaches to automatic term identification typically rely on this property. These approaches identify candidate terms using a linguistic filter that checks whether a sequence of words conforms to some syntactic pattern. Different researchers use different syntactic patterns for their linguistic filters (e.g., Bourigault 1992; Dagan and Church 1994; Daille et al. 1994; Justeson and Katz 1995; Frantzi et al. 2000).

Statistical approaches to measure unithood are discussed extensively by Manning and Schu¨tze (1999). The simplest approach uses frequency of occurrence as a measure of unithood (e.g., Dagan and Church 1994; Daille et al. 1994; Justeson and Katz 1995). More advanced approaches use measures based on, for example, (pointwise) mutual information (e.g., Church and Hanks 1990; Damerau 1993; Daille et al. 1994) or a likelihood ratio (e.g., Dunning 1993; Daille et al. 1994). Another statistical approach to measure unithood is the C-value (Frantzi et al. 2000). The NC-value (Frantzi et al. 2000) and the SNC-value (Maynard and Ananiadou 2000) are extensions of the C-value that measure not only unithood but also termhood. Other statistical approaches to measure termhood can be found in the work of, for example, Drouin (2003) and Matsuo and Ishizuka (2004). In the field of machine learning, an interesting statistical approach to measure both unithood and termhood is proposed by Wang et al. (2007).

Termhood is measured as the degree to which the occurrences of a semantic unit are biased towards one or more topics.

In the first step of our methodology, we use a linguistic filter to identify noun phrases. We first assign to each word occurrence in the corpus a part-of-speech tag, such as noun, verb, or adjective. The appropriate part-of-speech tag for a word occurrence is determined using a part-of-speech tagger developed by Schmid (1994, 1995). We use this tagger because it has a good performance and because it is freely available for research purposes.

The most common approach to measure unithood is to determine whether a phrase occurs more frequently than would be expected based on the frequency of occurrence of the individual words within the phrase.

To measure the unithood of a noun phrase, we first count the number of occurrences of the phrase, the number of occurrences of the phrase without the first word, and the number of occurrences of the first word of the phrase. In a similar way as Dunning (1993), we then use a so-called likelihood ratio to compare the first number with the last two numbers.

The main idea of the third step of our methodology is to measure the termhood of a semantic unit as the degree to which the occurrences of the unit are biased towards one or more topics.

To measure the degree to which the occurrences of semantic unit uk, where k (belongs to) {1,…,K}, are biased towards one or more topics, we use two probability distributions, namely the distribution of semantic unit uk over the set of all topics and the distribution of all semantic units together over the set of all topics. These distributions are denoted by, respectively, P(tj | uk) and P(tj), where j (belongs to) {1,…, J}. ... The dissimilarity between the two distributions indicates the degree to which the occurrences of uk are biased towards one or more topics. We use the dissimilarity between the two distributions to measure the termhood of uk.

For example, if the two distributions are identical, the occurrences of uk are unbiased and uk most probably does not represent a domain-specific concept. If, on the other hand, the two distributions are very dissimilar, the occurrences of uk are strongly biased and uk is very likely to represent a domain-specific concept.

The dissimilarity between two probability distributions can be measured in many different ways. One may use, for example, the Kullback–Leibler divergence, the Jensen–Shannon divergence, or a chi-square value.

In (3), termhood (uk) is calculated as the negative entropy of this distribution. Notice that termhood (uk) is maximal if P(tj | uk) = 1 for some j and that it is minimal if P(tj | uk) = P(tj) for all j. In other words, termhood (uk) is maximal if the occurrences of uk are completely biased towards a single topic, and termhood (uk) is minimal if the occurrences of uk do not have a bias towards any topic.

In order to allow for a many-to-many relationship between corpus segments and topics, we make use of probabilistic latent semantic analysis (PLSA) (Hofmann 2001).

It was originally introduced as a probabilistic model that relates occurrences of words in documents to so-called latent classes. In the present context, we are dealing with semantic units and corpus segments instead of words and documents, and we interpret the latent classes as topics.

PLSA assumes that each occurrence of a semantic unit in a corpus segment is independently generated according to the following probabilistic process. First, a topic t is drawn from a probability distribution P(tj), where j (belongs to) {1,…,J}. Next, given t, a corpus segment s and a semantic unit u are independently drawn from, respectively, the conditional probability distributions P(si | t), where i (belongs to) {1,…,I}, and P(uk | t), where k (belongs to) {1,…,K}. This then results in the occurrence of u in s.

P(si, uk) =  sum (from j=1 to J) P(tj)P(si|tj)P(uk|tj)

We estimate these parameters using data from the corpus. Estimation is based on the criterion of maximum likelihood. The log-likelihood function to be maximized is given by
L = sum(from i=1 to I) sum(from k=1 to K) nik log P(si, uk)
We use the EM algorithm discussed by Hofmann (1999, Sect. 3.2) to perform the maximization of this function.

After estimating the parameters of PLSA, we apply Bayes’ theorem to obtain a probability distribution over the topics conditional on a semantic unit. This distribution is given by

P(tj|uk) = P(tj)P(uk|tj) / sum (from j=1 to J) (P(P(tj)P(uk|tj))

In a similar way as discussed earlier, we use the dissimilarity between the distributions P(tj | uk) and P(tj) to measure the termhood of uk.

We first selected a number of OR journals. This was done based on the subject categories of Thomson Reuters. The OR field is covered by the category Operations Research & Management Science. Since we wanted to focus on the core of the field, we selected only a subset of the journals in this category. More specifically, a journal was selected if it belongs to the category Operations Research & Management Science and possibly also to the closely related category Management and if it does not belong to any other category. This yielded 15 journals, which are listed in the first column of Table 1.

In the first step of our methodology, the linguistic filter identified 2662 different noun phrases. In the second step, the unithood of these noun phrases was measured. 203 noun phrases turned out to have a rather low unithood and therefore could not be regarded as semantic units. ... The other 2459 noun phrases had a sufficiently high unithood to be regarded as semantic units.

In the third and final step of our methodology, the termhood of these semantic units was measured. To do so, each title-abstract pair in the corpus was treated as a separate corpus segment. For each combination of a semantic unit uk and a corpus segment si, it was determined whether uk occurs in si (nik = 1) or not (nik = 0). Topics were identified using PLSA. This required the choice of the number of topics J. Results for various numbers of topics were examined and compared. Based on our own knowledge of the OR field, we decided to work with J = 10 topics.

The evaluation of a methodology for automatic term identification is a difficult issue. There is no generally accepted standard for how evaluation should be done. We refer to Pazienza et al. (2005) for a discussion of the various problems.

We first perform an evaluation based on the well-known notions of precision and recall. We then perform a second evaluation by constructing a term map and asking experts to assess the quality of this map.

Precision is the number of correctly identified terms divided by the total number of identified terms.

Recall is the number of correctly identified terms divided by the total number of correct terms.

Unfortunately, because the total number of correct terms in the OR field is unknown, we could not calculate the true recall. This is a well-known problem in the context of automatic term identification (Pazienza et al. 2005).

To circumvent this problem, we defined recall in a slightly different way, namely as the number of correctly identified terms divided by the total number of correct terms within the set of all semantic units identified in the second step of our methodology. Recall calculated according to this definition provides an upper bound on the true recall. However, even using this definition of recall, the calculation of precision and recall remained problematic. The problem was that it is very time-consuming to manually determine which of the 2459 semantic units identified in the second step of our methodology are correct terms and which are not. We solved this problem by estimating precision and recall based on a random sample of 250 semantic units.

It is clear from the figure that our methodology outperforms the two simple alternatives. Except for very low and very high levels of recall, our methodology always has a considerably higher precision than the variant of our methodology that does not make use of PLSA.

A term map is a map, usually in two dimensions, that shows the relations between important terms in a scientific field. Terms are located in a term map in such a way that the proximity of two terms reflects their relatedness as closely as possible. That is, the smaller the distance between two terms, the stronger their relation. The aim of a term map usually is to visualize the structure of a scientific field.

It turned out that, out of the 2459 semantic units identified in the second step of our methodology, 831 had the highest possible termhood value. This means that, according to our methodology, 831 semantic units are associated exclusively with a single topic within the OR field. We decided to select these 831 semantic units as the terms to be included in the term map. This yielded a coverage of 97.0%, which means that 97.0% of the title-abstract pairs in the corpus contain at least one of the 831 terms to be included in the term map.

The term map of the OR field was constructed using a procedure similar to the one used in our earlier work (Van Eck and Waltman 2007b). This procedure relies on the association strength measure (Van Eck and Waltman 2009) to determine the relatedness of two terms, and it uses the VOS technique (Van Eck and Waltman 2007a) to determine the locations of terms in the map.

The most serious criticism on the results of the automatic term identification concerned the presence of a number of rather general terms in the map.

Another point of criticism concerned the underrepresentation of certain topics in the term map. There were three experts who raised this issue. One expert felt that the topic of supply chain management is underrepresented in the map. Another expert stated that he had expected the topic of transportation to be more visible. The third expert believed that the topics of combinatorial optimization, revenue management, and transportation are underrepresented.

As discussed earlier, when we were putting together the corpus, we wanted to focus on the core of the OR field and we therefore only included documents from a relatively small number of journals. This may for example explain why the topic of transportation is not clearly visible in the map.

When asked to divide the OR field into a number of smaller subfields, most experts indicated that there are two natural ways to make such a division. On the one hand, a division can be made based on the methodology that is being used, such as decision theory, game theory, mathematical programming, or stochastic modeling. On the other hand, a division can be made based on the area of application, such as inventory control, production planning, supply chain management, or transportation. There were two experts who noted that the term map seems to mix up both divisions of the OR field. According to these experts, one part of the map is based on the methodology-oriented division of the field, while the other part is based on the application-oriented division.

The experts pointed out that sometimes closely related terms are not located very close to each other in the map. One of the experts gave the terms inventory and inventory cost as an example of this problem. In many cases, a problem such as this is probably caused by the limited size of the corpus that was used to construct the map. In other cases, the problem may be due to the inherent limitations of a two-dimensional representation.

Our main contribution consists of a methodology for automatic identification of terms in a corpus of documents. Using this methodology, the process of selecting the terms to be included in a term map can be automated for a large part, thereby making the process less labor-intensive and less dependent on expert judgment. Because less expert judgment is required, the process of term selection also involves less subjectivity.

In general, we are quite satisfied with the results that we have obtained. The precision/recall results clearly indicate that our methodology outperformed two simple alternatives. In addition, the quality of the term map of the OR field constructed using our methodology was assessed quite positively by five experts in the field. However, the term map also revealed a shortcoming of our methodology, namely the incorrect identification of a number of general noun phrases as terms.

As scientific fields tend to overlap more and more and disciplinary boundaries become more and more blurred, finding an expert who has a good overview of an entire domain becomes more and more difficult. This poses serious difficulties for any bibliometric method that relies on expert knowledge.

沒有留言:

張貼留言