科學映射圖(scientific mapping)能夠科學結構(scientific structure)視覺化,幫助使用者確認科學主題(scientific themes)並從而發現新知識的有用工具之一。過去的研究曾經使用過作者、文章與等映射單位。在計算映射單位之間的關連,Börner, Chen, and Boyack (2005) 將關連性的測量方法(relatedness measures)分為引用連結(citation linkages)與共現相似性(co-occurrence similarities)等兩大類,而本研究則將目前常用來評估作者間的關連分為直接引用(direct citation)、共被引分析(cocitation analysis)、合著分析(co-authorship analysis)、書目耦合分析(bibliographic coupling analysis)以及共詞分析(co-word analysis)等五種方法。也有研究以發展出整合文字內容與連結的測量方法來計算期刊(Ahlgren & Colliander, 2009; Boyack & Klavans, 2010; Cao &Gao, 2005)與文章(Liu et al., 2010)間的關連。本研究建議兩種以詞語為基礎並利用向量空間模式(vector space modeling)的方法和另一種基於LDA(latent Dirichlet allocation)的主題模型方法來測量作者之間的關連。本研究將第一種方法稱為靜態(static)的特徵,以每位作者曾寫過的論文內容為基礎產生代表這位作者的特徵向量,也就是代表這位作者的特徵向量是所有他寫過的論文的特徵向量總和,任何兩位作者之間的關連是對應於他們的作者特徵向量之間夾角的餘弦值(cosine value)。第二種方法則是動態(dynamic)的特徵,如果兩位作者之間沒有合著的論文,他們之間的關連仍然是他們的作者特徵向量之間夾角的餘弦值,但如果他們曾經合著過,在計算他們之間的關連時,先將他們合著論文的特徵向量排除在他們的作者特徵向量之外,在進行餘弦值計算,所以在計算每位作者和其他作者之間關連時所使用的作者特徵向量可能是變動的,因此稱為動態。基礎的主題模型假設每一個論文都是主題的混合(mixture),而每一個主題則都是詞語的混合。對於每一個論文,它的主題混合由一個已知參數α的Dirichlet分布所產生;每一個主題的詞語混合則由另一個已知參數β的Dirichlet分布所產生。在產生論文d前先根據Dirichlet分布Dir(α)取樣產生它的主題混合θd,然後再產生這個論文裡的每一個詞語,每一個詞語的產生是根據從主題混合θd中取樣得到的主題z以及其相對應的詞語混合ϕk所產生。本研究採用Rosen-Zvi, Chemudugunta, Griffiths, Smyth, and Steyvers (2010)將作者資訊加入而擴充的LDA模型-- 作者-主題模型(author-topic model),這個模型假定每個作者是由一個已知參數α的Dirichlet分布所產生的主題混合。假設一個論文的作者群為ad ,在產生這個論文的每一個詞語時,首先從ad 中隨機抽取一個作者x以及他的主題混合θx,然後其主題z便由θx取樣產生。本研究利用Gibbs取樣(Gibbs sampling, Griffiths & Steyvers, 2004)進行作者-主題模型推論,產生包含每一個主題在詞語上的分布情形以及對每一位作者產生他在各主題上的分布情形等結果。因此利用作者-主題模型可以根據他們在主題分布的相似度測量他們的關連。
本研究的資料範圍為2000到2010年出版的圖書資訊學相關的八種主要期刊的 5227筆書目紀錄,從其中的 6282位不同的作者內選取50位最多產的作者。利用靜態特徵、動態特徵、主題模型和共被引分析等四種方法測量多產作者之間的關連並利用MDS (multidimensional scaling)和階層式叢集分析(hierarchical cluster analysis)進行視覺化。本研究在利用主題模型測量作者之間的關連時使用以下的參數,α設為50/K,其中的K是主題的數量,本研究設為20,β設為0.01,Gibbs取樣的迭代(iteration)次數設為1000次。針對每一對作者的四種關連測量方法所得到的值進行相關分析(correlation analysis),結果發現靜態特徵與動態特徵之間有最高的相關值,主題模型和其他兩種以內容為基礎的測量方法的相關值也較共被引方法來得高。四種測量方法皆可以發現LIS領域的兩大主軸:一個主軸是資訊檢索(information retrieval)與網路研究(web studies),另一則是科學評鑑(scientific evaluation)的測量指標(metrics)研究,LDA模型則在階層式叢集分析上有最連貫的結果。另外,以內容為基礎的方法比以引用為基礎的方法更容易解釋產生的結果。
In this study we present static and dynamic word-based approaches using vector space modeling, as well as a topic-based approach based on latent Dirichlet allocation for mapping author research relatedness.
Outcomes for the two word-based approaches and a topic-based approach for 50 prolific authors in library and information science are compared with more traditional author cocitation analysis using multidimensional scaling and hierarchical cluster analysis.
Science mapping is one of the most useful tools to visualize scientific structure. It helps to identify scientific themes, and discover new knowledge.
The unit of interest for mapping may include authors, articles, and journals.
To date, five approaches have been used to measure the relatedness between authors, where the nature of the relationship studied is based on the data used: direct citation, cocitation analysis, co-authorship analysis, bibliographic coupling analysis, and co-word analysis.
Recently, more sophisticated hybrid methods (i.e., using textual content and citations) have been applied to the mapping of articles (Ahlgren & Colliander, 2009; Boyack & Klavans, 2010; Cao &Gao, 2005) and journals (Liu et al., 2010).
As an initial investigation of these topics, our focus will be on authors whose publications appear in the highest impact library and information science journals.
In reviewing visualization studies for knowledge domains, Börner, Chen, and Boyack (2005) categorized relatedness measures into two broad categories: citation linkages and co-occurrence similarities.Within the relatedness measures, five basic approaches were identified: direct citation, cocitation analysis, co-authorship analysis, bibliographic coupling, and co-word analysis.
Direct citation accounts for the relatedness between a citing work and a cited work based on citing behavior. ... Shibata, Kajikawa, Takeda, and Matsushima (2008) explored citation networks for two research domains and divided the networks into clusters in order to identify research fronts. Direct citation has not attracted wide attention. One possible reason may be its requirement for a very long time window to obtain a sufficient linking signal for clustering (Boyack & Klavans, 2010).
The idea that two articles that share the same references are related, referred to as bibliographic coupling, was outlined by Kessler (1963). The more references two articles have in common, the more closely related they are thought to be. Note that this list is static over time because references within articles do not change. With the interrelation of this link, scientific products can be ordered into groups. Weinberg (1974) reviewed the theory and practical applications of bibliographic coupling and granted the usefulness of the method. More recently, Zhao and Strotmann (2008) aggregated bibliographic coupling at an author’s oeuvre (body of work) level, which they called author bibliographic-coupling analysis (ABCA). They found ABCA can provide an effective picture of current active research in a field.
Cocitation analysis, introduced by Small (1973), is probably the most influential approach for assessing relatedness measures. If two articles are cited by the same third article, these two articles are co-cited. The assumption is that the appearance of two articles in the same reference list indicates a semantic association between the articles. Unlike traditional bibliographic coupling, cocitation is a dynamic relationship based on the citing authors. New citing authors can change the cocitation relationship. This feature is important because science is developing continuously. Relationships among scientific units being studied should be able to incorporate this dynamic change.
White and Griffith (1981) first applied cocitation techniques to authors, called author cocitation analysis or ACA. The essential transformation is to consider “Author” as a body of writings by a person (i.e., an oeuvre). So the cocitation of authors applies to any work by any author being co-cited with any work by another author.
Since then, a number of studies have been conducted using variations of the ACA method, including normalization (Ahlgren, Jarneving,&Rousseau, 2003; Leydesdorff&Vaughan, 2006; White, 2003; van Eck & Waltman, 2009), author counts (Zhao & Strotmann, 2011), and last-author ACA (Zhao & Strotmann, 2010).
One disadvantage of cocitation analysis is the lack of cognitive interpretation of the relatedness of the co-cited units. Without enough domain knowledge, one can hardly interpret the cocitation map.
Leydesdorff (1987) argued that cocitation maps only partially represent the structure of science.
A co-authorship relationship is established when authors co-publish a paper. Glänzel (2001) studied international co-authorship links to reveal the structures in international collaborations. Liu, Bollen, Nelson, and Van de Sompel (2005) constructed a network with co-authorship relations in the field of digital libraries. Ding (2011b) studied scientific collaborations and citation patterns of researchers and combined the results with a topic model approach to examine collaborations among researchers who share similar and different research interests.
It is this feature of co-authorship that makes co-authorship analysis more revealing of a social network rather than a scientific structure.
Co-word analysis collects evidence of relatedness from co-occurring keywords from different articles. Compared with the approaches introduced earlier, co-word analysis directly uses actual contents to measure relatedness, whereas the others find indirect evidence through citation and co-author relations. An obvious advantage of co-word analysis is that relatedness can be interpreted directly according to document contents.
Coulter, Monarch, and Konda (1998) mapped the discipline of software engineering with co-word analysis. Indexing terms from the ACM Computing Classification System were used as the unit of analysis. Ding, Chowdhury, and Foo (2001) conducted a co-word analysis on a sample of 2,012 articles from the Web of Science (WoS) to reveal themes of information retrieval research.
Leydesdorff (1997) noted that the meaning of words change from position to position and from one text to another. He also suggested this change will destabilize the science map produced by co-word analysis.
Another disadvantage of using indexer-assigned keywords as the source for co-word analysis is the “indexer effect” (Law & Whittaker, 1992), which creates bias through factors such as the artificiality of an indexing language, delays in changes to the indexing language to reflect the current state of a discipline, and subjectivity in the assignment of index terms.
In the vector space, a number of documents constitute a document space. The centroid of the document space is a summarization of the characteristics of the space. It represents the average vector for a group of documents.
Each author will be viewed as a document space consisting of the articles he/she has written. This space is a subspace of the collection space, named the author space. The centroid of the author space will be used to represent the author. The relatedness between authors will be measured through the similarity between the centroids of their author spaces.
The topic model is an improvement over the basic vector space model in terms of relieving the independence assumption and capturing the term associations. Instead of assuming independence among terms, the topic model assumes exchangeability among terms in documents, which is a much looser assumption.
Early works on the topic model include latent semantic indexing (LSI) by Deerwester et al. (1990) and the probabilistic LSI (pLSI) by Hofmann (1999). LDA is a more recent technique proposed by Blei, Ng, and Jordan (2003). It has an advantage over LSI in explicitly modeling the latent topics, and over pLSI in solving the overfitting problem (i.e., a model with too many parameters).
The LDA model treats a document as a mixture of topics and a topic as a mixture of terms. Each document (i.e., a mixture of topics θ) is generated from a latent Dirichlet distribution with a prior of α, and each topic (i.e., a mixture of terms ϕk) is generated from a Dirichlet distribution with a prior of β. The generation process entails, first, sampling a document θd from Dir(α). At each position of a word in a document, a topic z is selected according to θd, and a word w is selected according to z and ϕk.
Rosen-Zvi, Chemudugunta, Griffiths, Smyth, and Steyvers (2010) extended the original LDA model to include authors and proposed the author-topic model (Figure 2). This model includes authorship information in the generative process. Each document has a number of authors ad. Each author is considered as a distribution of topics drawn from a Dirichlet distribution with a prior of α. For each word in a document, an author x is randomly drawn from ad and the topic distribution associated with this author is θx. Then a topic z is selected the same way as in a LDA model to generate the observed word w.
The advantage of this author-topic model is that it adds authorship information to the model, so that the topics are learned and assigned to documents accordingly. In the output of this model, each author is a distribution of different topics; each topic is a distribution of terms. As the purpose of the current study is to measure the relatedness of authors, the author-topic model will be appropriate to produce author similarities based on their topics.
Gibbs sampling (Griffiths & Steyvers, 2004) is used to estimate the parameters in the model.
Table 1 lists the eight journals selected for inclusion in the study. ... Bibliographic records for documents published in these journals between 2000 and 2010 were downloaded. Records downloaded were further limited to three document types: articles, proceedings papers, and reviews. ... In total, 5,227 records were downloaded from WoS. The raw WoS records were processed, and only three fields were kept: the article title (i.e., “TI” field), the Keywords Plus (i.e., “ID” field), and the abstract (i.e., “AB” field). The records then were indexed with the widely used Lemur information retrieval toolkit (http://www.lemurproject.org/). Stop words were removed and stemming was applied.
From the 5,227 records downloaded, we were able to identify 6,282 different author names using string matching. Because it is impractical to map all of the authors in our collection, we selected the 50 most prolific authors according to the WoS “analyze results” function. ... We selected the most prolific authors because the more an author writes, the better the algorithm used “understands” her/his interests, and thus the more accurate our assessment will be.
For each author in our author list we then generated an author space consisting of all the articles he/she wrote. TF*IDF term weighting was employed to assign term significance in the space. Terms that were single characters or only consisted of digits (e.g., “2001”) were filtered out. We believe that these terms add noise to the space rather than meaning. The relatedness between authors is measured through the cosine between the centroids of the author spaces.
One could argue that this creates a biased assessment of the strength of the relationship because there is an exact match for the text of the co-authored publications that creates a stronger bond than for two authors who have published in a common area but did not collaborate. On the other hand, the simple fact that the collaboration has resulted in one or more co-authored documents should be acknowledged as a strong tie between the authors.
In a static space, each author has her/his own space that consists of her/his articles. This space does not change when measuring author relatedness. ... The relatedness of authors will include the similarity arising from the strength of the co-authorships.
Conversely, in the dynamic author space, the author spaces depend on a pair of authors. Co-authored articles by the pair of authors are excluded. In this case, each author may have a different author space when measured with different authors.
The vector space model provides a number of readily available measures of relatedness. The most popular is the cosine measure, which measures the cosine of the angle formed by two vectors in the space. It basically measures the term weight distribution between two vectors. The more similar the distribution is, the higher the cosine value is expected to be.
Gibbs sampling (Griffiths & Steyvers, 2004) was used to estimate the parameters in the author-topic model. We set the number of iterations to 1,000. The hyperparameter α was set to 50/K where K is the number of topics and hyper β is set to 0.01.We tested different K, or number of topics, values and decided to report the results from K = 20 because it produced the most reasonable outcome by our judgment.
The topic model toolbox was employed to perform the learning process (http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm).
An author-topic LDA model (Rosen-Zvi et al., 2010) was trained on our collection and a pair-wise cosine similarity measure comparison of the 50 authors was conducted, resulting in a symmetric matrix of similarity values based on the LDA modeling. Similarity matrices were also calculated for both the static and dynamic author spaces. Multidimensional scaling was used to visualize the relationships among the authors. ... Because the data represent a type of similarity measure, SPSS PROXSCAL was used to construct the map, as recommended by Leydesdorff and Vaughan (2006). To provide additional insights into the grouping of the authors, hierarchical cluster analysis (complete linkage method) was used in SPSS to superimpose groups of authors on the MDS maps to provide an additional means to assess the coherence in the resulting proximities between authors.
After tokenization of the field contents, 916,383 tokens, or individual words, were identified; the number of unique tokens, or distinct words, was 12,537. The average document length was 175.32 tokens.
An examination of the pair-wise correlation of these author relatedness measures reveals significant and moderate level correlations between the word-based, topic-based, and author cocitation measures (Table 5). It is not surprising that the static author map has a high correlation with the dynamic author map (Kendall’s tau b = 0.971). Similarly, the correlations among the three content-based approaches are generally higher than their correlations with the cocitation approach. This provides preliminary evidence that they measure different types of relationships.
In all cases, the largest singular group consists of authors who work with different aspects of metrics-based studies, which is labeled as “Informetrics” in general in the two word-based maps and “Scientific impact evaluation” in the other two maps. This labeling indicates that the metrics-related topics have been a frequently investigated theme by the prolific authors in the selected journals during the first decade of the 21st century.
It is also noteworthy that the topic groupings of each of the maps largely aligns along the horizontal or vertical axis, with one side representing information retrieval (system and behavior) and web studies, with the other side corresponding to metrics-based or scientific evaluation studies.
As is shown from the maps, the static map (Figure 3) and dynamic map (Figure 4) are generally consistent in terms of the location of the authors, which indicates that the exclusion of similarities resulting from collaborations does not affect the overall layout. However, drastic changes may happen to individuals who have collaborated frequently with another author.
At the four-cluster agglomeration, the LDA map (Figure 6) provides the most coherent representation of the author map in relation to the generated clusters. At the two-cluster agglomeration, the clusters are neatly divided along the vertical axis, with metrics-related research represented on the left, and web and information retrieval-related themes on the right. Although the group membership of some individuals is still debatable, such as “Ingwersen_P” in the “Scientific impact evaluation” group given that he has also published in information retrieval and webometrics, the overall layout of the LDA map does provide semantically meaningful relationships.
Of the five author relatedness methods discussed earlier, only co-authorship provides a direct connection between authors.
Cocitations are contributed by third parties.
Direct citations reflect an author’s assessment of relatedness to a cited author or work but are still based on perception or the subjectivity inherent in citer motivation (Bornmann & Daniel, 2008).
This is also the case for bibliographic coupling, where the strength of the relationship is assessed by the overlap of references selected by two authors.
Co-word or topic-based studies can be argued to be the least influenced by citing behavior because they rely solely on the words developed by the authors themselves.
The newly proposed content-based approaches overcome several limitations of the more traditional cocitation approach.
In addition to avoiding citer subjectivity inherent in citation-based data, the links between authors will be more interpretable compared with the cocitation maps. The top terms/topics will be identifiable to help interpret the links between authors.
The content-based methods do not require an author to be cited in order to be included in the map. As long as the author has some publication record, her/his relatedness with other authors can be identified. This provides the opportunity for researchers who have not been widely cited to be included in the author map.
Furthermore, cocitation analysis outcomes may be affected by limited numbers of citations that do not reflect the true strength of the relationship between authors. This can be seen when comparing the cocitation outcomes with the topic-based outcomes, where several authors with low citation counts, and therefore low cocitation counts, end up at the periphery of the map. For the LDA outcome, these authors are more centrally situated among authors with similar topic areas.
The word-based and topic-based methods can be considered an extension of co-word analysis, where words are used to determine the relatedness of authors.
In Healey, Rothman, and Hoch (1986), a paradox is introduced: if a map represents a field that is already known to experts, then it is useless because it does not reveal anything new; if the map deviates from the expectation of the experts, then its outcome is questionable.
This initial investigation, which compares prolific authors from LIS, demonstrates: (1) the potential for more topically meaningful outcomes from the new methods when compared to more traditional cocitation analysis; (2) the topicbased method using LDA for the data used in this study produces more distinctive clusters and reasonable results than the two word-based approaches.
沒有留言:
張貼留言