由於發現新的物種、疾病與社交模式,產生新的研究主題與專業 (Li et al., 2010; Yan, Ding, Milojevic, & Sugimoto, 2012),經過一段時間後,相關的研究社群會成長或是規模改變,有些主題仍然持續,但有些則是消失 (Griffiths & Steyvers, 2004; Upham & Small, 2010; Shi, Nallapati, Leskovec, McFarland, & Jurafsky, 2010)。已有許多研究利用書目資料來確認研究的專業,例如Kessler (1963)的論文書目耦合網路(bibliographic coupling networks)、Small (1973)的論文共被引網路(paper co-citation networks)、White 與 McCain (1998)的作者共被引網路(author co-citation networks)以及White (2003)的尋路網路 (pathfinder networks),Callon、Courtial 與 Laville (1991)、Ding、Chowdhury 與 Foo (2000)、Milojevic、Sugimoto、Yan 與 Ding (2011)則是使用詞語共現網路 (co-word networks)。這些研究各自在不同研究層次確認研究主題,例如論文層次有 Chen (2004, 2006)、 Kessler (1963)和 Small (1973),作者層次有 Clauset, Newman, & Moore (2004)、White & McCain (1998)和 White (2003),期刊層次如 Glänzel & Schubert (2003)、 Leydesdorff & Vaughan (2006),以及領域層次有 Janssens, Zhang, Moor, & Glänzel (2009)、Rafols & Leydesdorff (2009)、Zhang, Liu, Janssens, Liang, & Glänzel (2010)。較低的研究實體層級,如論文與作者,研究可以從領域內發現其他的主題或專業;但在期刊或領域等較高的層次,通常從更完整的資料中確認出次領域。確認主題的方法則有因素分析(factor analysis)和多維尺度(multidimensional scaling)等傳統的群集技術以及連結線中心性(edge betweenness)、群組性(modularity)和混合群集(hybrid clustering)等較新技術的應用。本研究(Yan, 2014)則是利用主題模型(topic model)確認研究主題,並提出主題延續性(topic continuity)及主題普遍性(topic popularity)等兩項動態特性來分析研究主題。應用主題模型技術考察主題動態的方法,包括事後分析(post hoc analysis)(例如: Griffiths & Steyvers, 2004; Hall, Jurafsky, & Manning, 2008)、分段法(segmented approaches) (例如:Bolelli, Ertekin,Zhou, & Giles, 2009)以及連續時間模型(continuous-time model) (Wang & McCallum, 2006)等。本研究採用事後分析,利用文件中各主題的機率分布評估主題存在的機率。
本研究針對每一年分別產生一個主題模型,對於每一個主題找出後一年最有可能的主題,評估兩個主題相似的方式是利用改良自Kullback-Leibler差異 (Kullback-Leibler divergence, KLD)的Jensen–Shannon差異 (Jensen–Shannon divergence, JSD),兩個模型P和Q的JSD計算方式為JSD(P||Q) = 1/2KLD(P||M)+1/2KLD(Q||M),KLD是兩個模型Kullback-Leibler差異,M=1/2(P+Q)。每一個主題後一年最有可能的主題是擁有最小JSD的主題,其分數JSD稱為JJSDS,運用JJSDS的變化趨勢計算主題的連續性,然後以z score進行標準化。
評估各主題的普遍性則是計算它們在該年度文件上平均的機率值,並以z score進行標準化,愈大的機率值表示該主題在當年度愈普遍,然後分析主題普遍性的變化趨勢,。
本論文的研究資料為2001到2011年的圖書資訊學(library and information science)出版品,包括期刊論文、書評及研討會論文等,採用論文的題名做為分析資料,共27,796 篇論文。每一年的主題數目都設為20。結果顯示在網路資訊檢索(web information retrieval)、引用及書目計量學(citation and bibliometrics)、系統及技術(system and technology)、健康科學(health science)等主題有較高的平均普遍性;h指標(h-index)、線上社群(online communities)、資料保存(data preservation)、社群媒體(social media)和網站分析(web analysis)等則是圖書資訊學裡愈來愈普遍的主題。研究結果的主題與過去的研究相符合,但這篇論文的貢獻在於對於研究主題的動態進行分析。
Dynamic development is an intrinsic characteristic of research topics. To study this, this
paper proposes two sets of topic attributes to examine topic dynamic characteristics:
topic continuity and topic popularity.
Topic continuity comprises six attributes: steady,
concentrating, diluting, sporadic, transforming, and emerging topics; topic popularity
comprises three attributes: rising, declining, and fluctuating topics.
These attributes are
applied to a data set on library and information science publications during the past 11
years (2001–2011).
Results show that topics on “web information retrieval”, “citation and
bibliometrics”, “system and technology”, and “health science” have the highest average
popularity; topics on “h-index”, “online communities”, “data preservation”, “social media”,
and “web analysis” are increasingly becoming popular in library and information science.
Dynamics is a constant theme in scientific explorations. Research communities may grow or change in size; new species,
diseases, or societal patterns may be discovered; and new research topics and specialties may be introduced (Li et al.,
2010; Yan, Ding, Milojevic, & Sugimoto, 2012). Over time, some topics are continuously investigated while others appear or
disappear (Griffiths & Steyvers, 2004; Upham & Small, 2010; Shi, Nallapati, Leskovec, McFarland, & Jurafsky, 2010). Therefore,
it is of great importance to examine research dynamics to understand the evolving cognitive structures of research domains.
Pioneering studies of paper bibliographic coupling networks (Kessler, 1963), paper co-citation networks (Small, 1973),
author co-citation networks (White & McCain, 1998), pathfinder networks (White, 2003) and co-word networks (e.g., Callon,
Courtial, & Laville, 1991; Ding, Chowdhury, & Foo, 2000; Milojevic, ´ Sugimoto, Yan, & Ding, 2011) were capable of identifying
research specialties from bibliographic data effectively.
However, findings from these studies remained largely static and
thus only yielded fixed perspectives on the cognitive structure of research domains.
To examine research dynamics,this study uses a topic modeling technique and proposes two sets of topic attributes–topic
continuity and topic popularity.
• How to use topic modeling techniques to study research dynamics?
• What quantitative measurements can be used to describe topic dynamics?
• What topics are present in library and information science? What are their dynamic characteristics?
This subsection reviews the network-based approaches of identifying research topics and specialties. These approaches
have been applied to several research levels, including the paper-level (e.g., Chen, 2004, 2006; Kessler, 1963; Small, 1973),
the author-level (e.g., Clauset, Newman, & Moore, 2004; White & McCain, 1998; White, 2003), the journal-level (e.g., Glänzel
& Schubert, 2003; Leydesdorff & Vaughan, 2006), and the field-level (e.g., Janssens, Zhang, Moor, & Glänzel, 2009; Rafols &
Leydesdorff, 2009; Zhang, Liu, Janssens, Liang, & Glänzel, 2010).
Most above-mentioned work used co-occurrence networks as the research instrument.
Analyses on lower level research
entities, such as papers and authors, usually identified topics and specialties from small but well-defined research fields;
whereas analyses on higher level research entities, such as journals and fields, attempted to identify subfields and
subdomains from more comprehensive data sets.
Both classic clustering techniques (e.g., factor analysis and multidimensional
scaling) as well as modern techniques (e.g., edge betweenness, modularity, and hybrid clustering) have been
applied.
Recently, studies have attempted to add dynamic analyses
by utilizing multiple time intervals.
Several approaches on slicing time intervals are available: intervals that have the same
amount of references (e.g., Radicchi et al., 2009), intervals that have the same number of publications (e.g., Sugimoto, Li,
Russell, Finlay, & Ding, 2011; Yan & Sugimoto, 2011), same-length intervals (e.g., Åström, 2007; Milojevic´ et al., 2011), and
accumulative intervals (e.g., Barabási et al., 2002; Yan & Ding, 2009).
These studies laid valuable methodological basis for
dynamic analyses of cognitive structures of research fields; however, networks of different time frames were largely analyzed
distinctively and a more integrated examination was lacking.
In the meantime, empirically, network-based clustering results
may require domain expertise to effectively interpret obtained results.
Topic modeling techniques use probabilistic models to assign papers, journals, or authors to clusters. A topic can be defined as a probability distribution over terms in a vocabulary (Blei & Lafferty, 2007). Latent Dirichlet Allocation (LDA) model, a classic topic model, was proposed by Blei et al. (2003). The model predicates that words for each paper are derived from a mixture of topics and each topic follows a multinomial distribution.
One recent update of the LDA
model is the supervised LDA model. It makes the analyses of multi-labeled corpora (e.g., tags from delicious.com and various
classifications) possible. Blei and McAuliffe’s (2010) version of supervised LDA can successfully address this challenge, but
a document can only be assigned with one label.
Ramage, Hall, Nallapati, and Manning (2009) offered an approach which enabled the multi-label assignment. Their supervised labeled LDA (L-LDA) associated one label with one topic and allowed the model to learn word-label relations.
Through topic modeling techniques, topic dynamics has been examined mainly through the following approaches: post hoc analysis (e.g., Griffiths & Steyvers, 2004; Hall, Jurafsky, & Manning, 2008), segmented approaches (e.g., Bolelli, Ertekin,Zhou, & Giles, 2009), and continuous-time model (Wang & McCallum, 2006).
Post hoc analysis uses topic-document probability distributions to evaluate the presence of identified topics.
Segmented approaches build the dynamic component in the probabilistic model. It assumes that the state of topics at a single time point is independent from all other time points and divides document corpora into segments that have contingent time stamps (Bolelli et al., 2009).
Continuous-time model is a non-Markov model proposed by Wang and McCallum (2006), where they found the non-Markov model provides better prediction and more interpretable topical trends.
In this study, a post hoc dynamic analysis using the ACT model is selected because of its marked performance (Tang et al., 2008) as well as its advanced input and output support.
Topic dynamics is calculated through the Author-Conference-Topic (ACT) model (Tang et al., 2008).
Specifically, i is the topic distribution for document i. Mean ( ¯), therefore, is a direct quantitative measurement to assess topic popularity: the higher the ¯, the more visible the topic, and thus the more popular that topic is (Griffiths & Steyvers, 2004).
Because the data set spans 11 years, 11 independent ACT models were run, one for each year of the data set based on year of publication.
The Jensen–Shannon divergence (JSD) was used as the similarity measurement to quantify the topic similarity between different word-topic distributions. ... JSD is a symmetrized and smoothed version of the Kullback–Leibler divergence (KLD). ... As a divergence measure, the smaller the JSD, the higher the similarity is.
In order to track the same topic from two adjacent time intervals, the minimum value for each row of a JSD matrix was used, referred to as the joint JSD score (JJSDS): MIN(JSD Matrix(i,j)), for j = 1:n. ... Applying the same approach to each pair of adjacent time slices, for each topic, an array of JJSDS can be obtained.
The attributes of steady, concentrating, and diluting topics focus on the overall topical characteristics whereas the attributes of sporadic, transforming, and emerging topics focus on the topical characteristics of a specified time frame. Therefore, these attributes are not mutual exclusive, suggesting that a topic can be a concentrating topic overall, and in the meantime, related topics were added and thus qualifying it for a transforming topic.
The data set contains publications of all journals indexed in the 2011 version of the Journal Citation Report in the Information Science & Library Science subject category. Articles, proceeding papers, and review articles published within these journals from 2001 to 2011 were downloaded for analysis (downloading time: October 2012). Stop words were then removed from publications’ titles. Publications without titles, authors, or journal names were removed from the data set. The final data set comprised 27,796 papers.
The number of topics is set at 20: this number considers the size of the paper corpus as well as previous empirical studies on the cognitive structure of library and information science (e.g., Milojevic´ et al., 2011; Sugimoto et al., 2011; White & McCain, 1998; Zhao & Strotmann, 2008). For reasons of consistency, the same number of topics was identified for each year of the data set.
In this subsection, we first present histograms made from values in Jensen–Shannon divergence (JSD) matrices (Fig. 4). These histograms provide a direct perception on how research topics in library and information science are related as measured by JSD. This subsection then introduces all 20 topics in each year from 2001 to 2011 as well as how topic continuity and popularity attributes are applied to these topics (Fig. 5).
Fig. 4 uses histograms to visualize JSD values for each pair of adjacent years. Because there are 20 topics for each year, the number of data points in each histogram is 400 (20 × 20). This number is 4000 for the histogram in the lower right section of Fig. 4, as it uses JSD values for all pairs of adjacent years.
This study finds that in library and information science, research topics on “web information retrieval”, “citation and bibliometrics”, “system and technology”, and “health science” have the highest average popularity over the past decade (from 2001 to 2011).
Research on “h-index”, “online communities”, “data preservation”, “social media”, and “web analysis” are increasingly becoming popular topics.
Overall, findings of this study are consistent with previous studies using co-word, co-citation, and topic modeling techniques.
For instance, a co-word study by Milojevic´ and colleagues (2011) has found that title terms “citation”, “impact factor”, and “web” have a rising usage from 1989 to 2008.
Other related dynamic studies that cover the target time frame of the current study (2001–2011) include Åström’s (2007) study on examining library and information science research front, where the study found that webometrics and information-seeking and retrieval have become dominating research areas between 2000 and 2004.
This finding has been verified by Klavans and Boyack (2011) where the authors used the global map (i.e., the map of science) to enhance to accuracy of local maps (i.e., the contextual map of information science). They identified five core areas in information science, including information-seeking behavior, computer-enhanced retrieval, scientometrics, co-citation analysis, and citation behavior.
Besides the contextual analysis of information science, structural analysis has also been achieved from a time-series empowered author co-citation and document co-citation analysis (Chen, Ibekwe-SanJuan, & Hou, 2010). Through the application of a series of structural metrics such as centrality measures, modularity and silhouette, a clear cognitive structure of information science was attained in that the research areas of interactive information retrieval, academic web, information retrieval, citation behavior, and h-index have gained a particular popularity from 1996 to 2008.
In addition to journal publications, Sugimoto and colleagues (2011) applied a LDA model to library and information science dissertations and demonstrated dissertations as an important communicative genre. Their study indicated that between 2000 and 2009, internet and information retrieval related topics were the central dissertation research themes.
The contribution of the current study is that it proposes two sets of quantitative topic attributes. These attributes have streamlined the dynamic analysis of research topics and specialties and have further complemented co-occurrence-based studies.
This paper has identified dynamic characteristics of topics in library and information science; however, limited information can be told about the mechanisms that resulted in such characteristics. That being said, the study is unable to pinpoint, for instance, whether the growing popularity of network and citation studies is the result of a growing research community, a drive by the commercial market, a stimulus from funding agencies, or a combination of these or other unlisted factors.
Popular topics may be associated with research communities that are expanding in size and/or tend to have higher productivity. Conversely, less popular topics may be associated with communities that are shrinking and/or have a reduced productivity. Topic continuity and popularity attributes reflect research specialties’ development in scientific communities, which is further guided by science policies and the attention of the general public.
In informetrics, studies have mainly focused on analyzing the performance and the social and cognitive implications of several types of research entities, including papers, authors, institutions, journals, and fields. Authors and institutions are typically used to examine social relations in academia; while journals and fields are predominantly used to investigate the cognitive structure of research domains.
Topic analysis can precisely provide a more refined assessment by clustering research papers based on certain probability distributions. Because of such quantitative results, a more integrated dynamic cognitive analysis is thus possible, as exemplified through the current study.
Topic analysis will be further developed by overlaying topics with author communities to explore the interwoven relationships between research topics and research communities (e.g., Yan et al., 2012); by overlaying topics with funding data to investigate the “lead-lag” relationship between funding support and productivity (e.g., Shi et al., 2010); by applying topic models to different genres to study research immediacy (e.g., Ding et al., 2013); and by overlaying topics with citation data to examine the relationships between topics and impact.
Ramage, Hall, Nallapati, and Manning (2009) offered an approach which enabled the multi-label assignment. Their supervised labeled LDA (L-LDA) associated one label with one topic and allowed the model to learn word-label relations.
Through topic modeling techniques, topic dynamics has been examined mainly through the following approaches: post hoc analysis (e.g., Griffiths & Steyvers, 2004; Hall, Jurafsky, & Manning, 2008), segmented approaches (e.g., Bolelli, Ertekin,Zhou, & Giles, 2009), and continuous-time model (Wang & McCallum, 2006).
Post hoc analysis uses topic-document probability distributions to evaluate the presence of identified topics.
Segmented approaches build the dynamic component in the probabilistic model. It assumes that the state of topics at a single time point is independent from all other time points and divides document corpora into segments that have contingent time stamps (Bolelli et al., 2009).
Continuous-time model is a non-Markov model proposed by Wang and McCallum (2006), where they found the non-Markov model provides better prediction and more interpretable topical trends.
In this study, a post hoc dynamic analysis using the ACT model is selected because of its marked performance (Tang et al., 2008) as well as its advanced input and output support.
Topic dynamics is calculated through the Author-Conference-Topic (ACT) model (Tang et al., 2008).
Specifically, i is the topic distribution for document i. Mean ( ¯), therefore, is a direct quantitative measurement to assess topic popularity: the higher the ¯, the more visible the topic, and thus the more popular that topic is (Griffiths & Steyvers, 2004).
Because the data set spans 11 years, 11 independent ACT models were run, one for each year of the data set based on year of publication.
The Jensen–Shannon divergence (JSD) was used as the similarity measurement to quantify the topic similarity between different word-topic distributions. ... JSD is a symmetrized and smoothed version of the Kullback–Leibler divergence (KLD). ... As a divergence measure, the smaller the JSD, the higher the similarity is.
In order to track the same topic from two adjacent time intervals, the minimum value for each row of a JSD matrix was used, referred to as the joint JSD score (JJSDS): MIN(JSD Matrix(i,j)), for j = 1:n. ... Applying the same approach to each pair of adjacent time slices, for each topic, an array of JJSDS can be obtained.
The attributes of steady, concentrating, and diluting topics focus on the overall topical characteristics whereas the attributes of sporadic, transforming, and emerging topics focus on the topical characteristics of a specified time frame. Therefore, these attributes are not mutual exclusive, suggesting that a topic can be a concentrating topic overall, and in the meantime, related topics were added and thus qualifying it for a transforming topic.
The data set contains publications of all journals indexed in the 2011 version of the Journal Citation Report in the Information Science & Library Science subject category. Articles, proceeding papers, and review articles published within these journals from 2001 to 2011 were downloaded for analysis (downloading time: October 2012). Stop words were then removed from publications’ titles. Publications without titles, authors, or journal names were removed from the data set. The final data set comprised 27,796 papers.
The number of topics is set at 20: this number considers the size of the paper corpus as well as previous empirical studies on the cognitive structure of library and information science (e.g., Milojevic´ et al., 2011; Sugimoto et al., 2011; White & McCain, 1998; Zhao & Strotmann, 2008). For reasons of consistency, the same number of topics was identified for each year of the data set.
In this subsection, we first present histograms made from values in Jensen–Shannon divergence (JSD) matrices (Fig. 4). These histograms provide a direct perception on how research topics in library and information science are related as measured by JSD. This subsection then introduces all 20 topics in each year from 2001 to 2011 as well as how topic continuity and popularity attributes are applied to these topics (Fig. 5).
Fig. 4 uses histograms to visualize JSD values for each pair of adjacent years. Because there are 20 topics for each year, the number of data points in each histogram is 400 (20 × 20). This number is 4000 for the histogram in the lower right section of Fig. 4, as it uses JSD values for all pairs of adjacent years.
This study finds that in library and information science, research topics on “web information retrieval”, “citation and bibliometrics”, “system and technology”, and “health science” have the highest average popularity over the past decade (from 2001 to 2011).
Research on “h-index”, “online communities”, “data preservation”, “social media”, and “web analysis” are increasingly becoming popular topics.
Overall, findings of this study are consistent with previous studies using co-word, co-citation, and topic modeling techniques.
For instance, a co-word study by Milojevic´ and colleagues (2011) has found that title terms “citation”, “impact factor”, and “web” have a rising usage from 1989 to 2008.
Other related dynamic studies that cover the target time frame of the current study (2001–2011) include Åström’s (2007) study on examining library and information science research front, where the study found that webometrics and information-seeking and retrieval have become dominating research areas between 2000 and 2004.
This finding has been verified by Klavans and Boyack (2011) where the authors used the global map (i.e., the map of science) to enhance to accuracy of local maps (i.e., the contextual map of information science). They identified five core areas in information science, including information-seeking behavior, computer-enhanced retrieval, scientometrics, co-citation analysis, and citation behavior.
Besides the contextual analysis of information science, structural analysis has also been achieved from a time-series empowered author co-citation and document co-citation analysis (Chen, Ibekwe-SanJuan, & Hou, 2010). Through the application of a series of structural metrics such as centrality measures, modularity and silhouette, a clear cognitive structure of information science was attained in that the research areas of interactive information retrieval, academic web, information retrieval, citation behavior, and h-index have gained a particular popularity from 1996 to 2008.
In addition to journal publications, Sugimoto and colleagues (2011) applied a LDA model to library and information science dissertations and demonstrated dissertations as an important communicative genre. Their study indicated that between 2000 and 2009, internet and information retrieval related topics were the central dissertation research themes.
The contribution of the current study is that it proposes two sets of quantitative topic attributes. These attributes have streamlined the dynamic analysis of research topics and specialties and have further complemented co-occurrence-based studies.
This paper has identified dynamic characteristics of topics in library and information science; however, limited information can be told about the mechanisms that resulted in such characteristics. That being said, the study is unable to pinpoint, for instance, whether the growing popularity of network and citation studies is the result of a growing research community, a drive by the commercial market, a stimulus from funding agencies, or a combination of these or other unlisted factors.
Popular topics may be associated with research communities that are expanding in size and/or tend to have higher productivity. Conversely, less popular topics may be associated with communities that are shrinking and/or have a reduced productivity. Topic continuity and popularity attributes reflect research specialties’ development in scientific communities, which is further guided by science policies and the attention of the general public.
In informetrics, studies have mainly focused on analyzing the performance and the social and cognitive implications of several types of research entities, including papers, authors, institutions, journals, and fields. Authors and institutions are typically used to examine social relations in academia; while journals and fields are predominantly used to investigate the cognitive structure of research domains.
Topic analysis can precisely provide a more refined assessment by clustering research papers based on certain probability distributions. Because of such quantitative results, a more integrated dynamic cognitive analysis is thus possible, as exemplified through the current study.
Topic analysis will be further developed by overlaying topics with author communities to explore the interwoven relationships between research topics and research communities (e.g., Yan et al., 2012); by overlaying topics with funding data to investigate the “lead-lag” relationship between funding support and productivity (e.g., Shi et al., 2010); by applying topic models to different genres to study research immediacy (e.g., Ding et al., 2013); and by overlaying topics with citation data to examine the relationships between topics and impact.
沒有留言:
張貼留言