Jeong, Y. K. & Song, M. (2016). Applying content-based similarity measure to
author co-citation analysis. In Proceedings of iConference 2016.
本研究利用引用文獻出現文句內容的相似性來測量作者的主題相關性(topical relatedness)。傳統的作者共被引分析(Author co-citation Analysis, ACA)做法是利用參考文獻裡被引用作者的共被引頻率(White and Griffith, 1981),然後利用Pearson相關係數 (Pearson correlation coefficient)或是 Salton提出的餘弦相似性測量作者的相似性,在書目計量學研究裡已經廣泛運用於確認與追蹤學科的知識結構(the intellectual structure of an academic discipline) (He & Hui, 2002)。然而這種做法並未考慮引用的內容,Jeong, Song, & Ding, (2014)與 Zhao & Strotmann (2014)則利用全文裡提到的作者並將有關的內容加入ACA的計算。
本研究認為累積被引作者出現的文句能夠代表作者的研究領域,因此利用JASIST的全文資料,剖析HTML,取出論文的後設資料(題名、作者姓名、出版年、DOI與摘要)、引用資訊(引用文句與參考文獻索引)以及參考資訊(作者姓名、出版年、題名與期刊)。在這個研究裡,共使用2003年1月到2015年6月的1910篇論文,合計77,408筆參考文獻。將引用文句與一般文句分開,連結文句內的參考文獻索引與參考資訊,選取100位最多被引用的作者,進行傳統的ACA以及本研究提出的新方法。本研究的新方法利用Mikolov et al., (2013)提出的Word2Vec 模型 (Word2Vec models),根據參考文獻出現的引用文句,找出作者間的相似性。Word2Vec 模型以大量的文本為基礎,利用類神經網路方法( neural network approaches),找出詞語之間的語意關係,將每一個出現於文句的詞語轉換成向量,使得這些向量之間的相似性能夠保持詞語在語意上的關係。本研究將被引用的作者姓名視為是引用文句中出現的詞語,測量作者間在研究主題的相似性與合作關係。
表2是傳統的ACA方法與本研究的方法分別找出的10組最相似的作者,本研究的方法找出10組最相似的作者中有一半是具有合作關係的作者。
另外,將兩種方法產生的作者關係分別繪製成網路圖,節點代表作者,利用PageRank決定的節點大小,節點的遠近由作者間的相似性決定,並且以Blondel, Guillaume, Lambiotte, & Lefebvre (2008)提出的社群偵測(community detection)方法進行分群。圖三與圖四分別是傳統ACA與本研究提出方法的結果。
圖三上可明顯地看到所有的作者分為兩群,依據社群偵測的分群結果,左邊的作者可再分為兩群:最左邊紅色的一群為研究資訊尋求行為(information seeking behavior)的作者,紫色的一群則與資訊檢索(information retrieval)研究有關,右邊綠色的一群則是研究書目計量學(bibliometrics)的作者。介於左右兩大群體的作者分別有兩位:Borgman和Salton。這兩位都是資訊科學領域傳統上會經常引用的作者。
在以Word2Vec方法產生的作者網路上,與資訊檢索有關的作者群組位於左方,包括上方的資訊尋求行為以及下方的文件檢索(document retrieval)兩個群組,書目計量學在圖四上則分為兩個有關的群組,一個主要包含作者分析(author analysis),另一則是期刊引用分析(journal citation analysis)與評鑑指標(evaluation indicator)。與圖三不同的是,圖四上的群組彼此間都有連結,並且圖形上更具體地呈現次學科(sub-disciplines)以及重要的作者。
Unlike other
ACA studies, we used citing sentences to reflect topical relatedness of authors.
In our research, we extended traditional approaches by adopting Word2Vec, one of deep learning methods, to measure author similarity.
We also conducted in-depth network analysis of author maps.
The results of Word2Vec-based
author map revealed more specific sub-disciplines and the important authors in perspective of
topical influence than traditional approach does.
Author co-citation Analysis (ACA), which was introduced by White and Griffith (1981), has been widely
used in bibliometrics researches to identify and trace the intellectual structure of an academic discipline
(He & Hui, 2002). In ACA, traditional approaches relied on the co-citation frequency of cited authors in the
reference section.
Thus, one of the main topics in ACA was methodological discussion of what kind of
measure is appropriate and relevant for calculation of author similarities (Leydesdorff, 2005; van Eck &
Waltman, 2007). Existing approaches based on co-citation frequencies such as Pearson correlation
coefficient and Salton’s cosine similarity, however, do not capture the citation content.
Thus, some recent
researches used the full-text to obtain the topical relatedness between the cited authors (Jeong, Song, &
Ding, 2014; Zhao & Strotmann, 2014). They analyzed the authors mentioned in the full-text and
incorporated contents related with cited authors into ACA.
In that sense, cumulated citing
sentences of cited authors are able to well represent the cited researches and cited authors’ research
areas. In addition, these citing sentences are particularly useful for summarization of a research
document.
Figure 1 shows the overall system flow of our approach.
For content analysis, however, we collected full-text research articles of JASIST
in HTML format. Through the HTML parsing process, we extracted the metadata (title, author name, year,
DOI and abstract), citation information (citing sentence, and reference id), and reference information
(author name, year, title, and journal).
To compare our method to traditional ACA, we computed author-pairs
in both approaches. In Word2Vec-based method, the full-text data, first, are splitting into sentences.
In second step, matching the citing sentences with reference id in reference section, we separated the
citing sentences and other general sentences. Then, citing sentences are preprocessed in the following
steps: tokenization, POS tagging, lemmatization of the tokenized sentence, and stop word removal.
From
these data, we trained Word2Vec model for calculating author similarity and generated author-author
similarity matrix. To compare the previous research, traditional author counting approach, we also
construct co-citation matrix based on citation counts. Since we preprocessed full-text including all
reference information, these matrices considered all cited authors.
To evaluation, we selected top 100
authors which are highly cited in both methodology, and conduct network analysis through visualizing
author maps.
The data was gathered from 1,910 full-text articles in the JASIST digital library over 12 years (from
January 2003 to June 2015). The 1,910 collected documents have 77,408 references. We extracted
elements from the full-text article: 1) citing sentences from the body of the article, 2) the references
information, and 3) all cited authors. Table 1 shows the basic statistics of collected data.
Word2Vec models, one of the neural network approaches, are able to carry semantic meanings and turns
text into a numerical form that deep-learning nets can understand (Mikolov et al., 2013). Based on a large
amount of plain text, Word2Vec trains relationships between words automatically.
Word2Vec spatially
encoded a word meaning and the relationship between words, which was originally applied to word
clustering or synonym detection (Wolf et al., 2014). We applied Word2Vec into author similarity measure
regarding cited author names as a word in plain text.
Since authors’ oeuvre was represented as the citing
sentences in research articles, the Word2Vec-based method could consider various topics of the author.
In the proposed approach, however, the author names are also trained as
words in a same citing sentence. Therefore, the similarity between two authors in the Word2Vec-based
method reflects both topical relatedness and collaborations.
Table 2 shows top 10 pairs by the traditional
ACA method (Pearson correlation based similarity) and the Word2Vec based approach respectively.
About the half of pairs resulted from the Word2Vec approach are the co-author relationship.
This results imply that the proposed
approach enables to detect wider range of author pairs in perspective of topical relatedness and grasp
more diverse research fields of information science.
To examine whether there are structural differences in two measures of author similarity, we constructed
two author networks with top 100 authors. For network visualization, we used PageRank (Brin & Page,
1998) to determine the node size and also adopted the modularity algorithm (Blondel, Guillaume,
Lambiotte, & Lefebvre, 2008) for the community detection.
Figure 3 illustrates roughly two parts that consist of information retrieval and bibliometrics, two
major research areas in JASIST. The author group of information retrieval (purple) along with information
seeking behavior (red) is located at the left side, and the author group related with bibliometrics is located
at the right side.
There are only two authors located between two groups (Borgman and Salton), who are
traditionally cited authors in the information science field. Borgman studied various topics including
information retrieval and scholarly communication and wrote the important books that had won the best
information science book from ASIST. Salton’s works also received a lot of citations for a long time in the
field of information science.
The
author group related with information retrieval in the left side of the network is split into information
seeking behavior (blue) located in the upper side of the network and document retrieval (yellow) located
at the bottom side of the network. The group related to bibliometrics is also separated into two parts: (1) a
cluster (green) including author analysis and (2) journal citation analysis and evaluation indicator (red).
Unlike Figure 3, the communities in the network are connected to each other. Brin is connected with both
document retrieval and citation analysis communities. This may be attributed to the fact that the
PageRank, developed by Brin and Page (1998), is used in information retrieval and also studied in
network analysis to compute node centrality.
In bibliometrics, PageRank is adopted as one of the
centralities in citation networks (Ding, Yan, Frazho, & Caverlee, 2009). Ingwesen, who is located between
information retrieval and bibliometrics, studied information retrieval in earlier works, he extended the
research area to network analysis such as webometrics.
It implies that the
authors linked by citation are topically grouped in the Word2Vec-based author network.
沒有留言:
張貼留言