2013年11月20日 星期三

Sugimoto, C. R., Li, D., Russell, T. G., Finlay, S. C., & Ding, Y. (2011). The shifting sands of disciplinary development: analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology, 62(1), 185-204.

Sugimoto, C. R., Li, D., Russell, T. G., Finlay, S. C., & Ding, Y. (2011). The shifting sands of disciplinary development: analyzing North American Library and Information Science dissertations using latent Dirichlet allocation. Journal of the American Society for Information Science and Technology, 62(1), 185-204.

本研究利用北美在1930到2009年間完成的3121筆博士論文,探討圖書資訊學(library and information science, LIS)主要研究主題的變化情形。過去在研究LIS領域的主題時,內容分析(content analysis)和共被引分析(cocitation analysis)是主要的研究方法,針對一個時期內的期刊論文進行分析,發現該時期主要的研究主題。重要的內容分析研究例如 Enger, Quirk, & Stewart (1989)、 Järvelin & Vakkari (1990, 1993)、Kumpulainen (1991)、 Hider & Pimm (2005)和  Fidel (2008);以期刊進行書目計量分析的研究有 Åström (2007, 2010)的兩篇論文;以論文作者進行書目計量分析的研究則包括Pettigrew & Nicholls (1994)、White & McCain (1998)、 Bates (1998)、 Budd (2000)、White (2001)、Levitt & Thelwall (2009a, 2009b)和Åström (2010)。目前大多數的研究有下列的限制:1)雖然大多數的研究都以期刊論文為研究資源,但已有許多研究(Bazerman, 1988; Hyland, 2000)指出不同文類(genres)的書寫與引用模式皆有不同,只有針對一種文類進行分析便很有可能只會產生單一觀點;2)除了以期刊論文為主要的研究資料外,大多數的研究並且以少數高被引的作者或隨機選取的論文做為代表性的資源,但由於分析的資料數量不大,容易受到少數的影響,使產生的結果可能無法代表全體的情形;3)目前的研究大多是針對一個時期的同時性(synchronic)研究,缺乏以趨勢變化分析為主的歷時性(diachronic)研究,少數的歷時性研究有Smeaton, Keogh, Gurrin, McDonald, & Sødring (2003)針對SIGIR研討會論文探討資訊檢索領域在25年間的變化,Sugimoto & McCain (2010)同樣探討資訊檢索領域的主題發展情形,Harter & Hooten (1992) 將1972-1990年分為三個時期研究the Journal of the American Society for Information Science & Technology articles上的論文與作者在書目計量上的變化,Järvelin and Vakkari (1993)則是分別對三個時期的期刊論文進行內容分析, Åström’s (2007) 將 LIS領域相關的期刊論文分為三個時期進行共被引分析。除了這些限制以外,上述研究的另一個問題是這種方法間接透過論文與作者進行分析,並非直接針對主題進行研究。先前Braam, Moed,&van Raan (1991a, 1991b)的研究嘗試利用詞語的共同出現做為分析的資訊來探討領域內的重要研究主題,然而Leydesdorff (1997)認為在不同的文件累積範圍內,單一詞語的出現有相當不同的意義,因此這樣的分析有待商榷。

本研究的研究方法是以LDA(latent Dirichlet allocation) (Blei, Ng, & Jordan, 2003)來確認各時期的隱藏主題,並且確認各個主體的代表論文。LDA是一種文件生成模型,目前已被廣泛地應用於主題的確認與分析,例如Rosen-Zvi, Grifftihs, Steyvers,&Smyth (2004)曾運用LDA探討作者和主題之間的關係;Tang, Jin, & Zhang (2008)延伸這樣的概念到學術網路(academic networks)上;McCallum,Wang, & Corrada-Emmanuel (2007)和Li et al., (2010)則分別運用LDA於社交網絡(social networks)和社會標籤社群(social tagging communities)。Blei & Lafferty (2007)則運用LDA了解各主題彼此間的相關性(correlations),Pruteanu-Malinici, Ren, Paisley,Wang, & Carin(2010)和Rzeszutek, Androutsos, & Kyan (2010)的研究關注在主題在時間上的變化。為了了解LDA進行主題確認的可行性,Griffiths & Steyvers (2004)和Zheng, McLean, & Lu (2006)將LDA產生的主題與現有的論文資料分類進行比較,也證明了這種方法的確可行。 LDA將文件表示成由隱藏的主題隨機混合而成,也就是每個文件包含多個詞語,並且每一個主題由詞語依不同比例分布。文件上的所有詞語是從這個文件對應的主題組合中依據它們的比例重複地隨機抽取而得,配合主題對應的詞語比例選出。本研究使用的LDA方法是由Rosen-Zvi, Grifftihs, Steyvers,&Smyth (2004)所擴充的作者-主題模型(author-topic models),在這個模型中不僅允許一個文件可以包含多個主題,它描述了一個文件可能具有多位作者,並且每位作者也可以針對多個主題書寫的情形。下圖是作者-主題模型的Bayesian網路模型示意圖:

1. 對某一位作者x來說,Θ 是選擇某一個主題z的機率分布。
2. 對某一個主題z來說,φ 是選擇某一個詞語w的機率分布。
3. ad 是文件的多位作者,x是從這些位作者當中隨機選取的一位。

在估算上述LDA模型裡的給定某一位作者後選擇某一個主題的機率(Θ)以及給定某一個主題後選擇某一個詞語的機率(φ)等兩個未知參數時可以使用Gibbs取樣演算法(Gibbs sampling algorithm),這個演算法使用連續Markov鏈取樣(successive Markov chain sampling),根據所有其他變數,重複抽取一對的作者(x)與主題(z),以實際上所得到的數值來估算

此處,nw[m][j]是第m個詞語被指定給第j個主題的次數,nkwsum[j]是所有詞語被指定給第j個主題的次數總和,V是詞彙的大小,也就是語料內共有多少種詞語種類的數量,na[x][j]是第j個主題被指定給作者x的次數,naksum[x]是所有主題被指定給作者x的次數總和,T是主題的數量。
根據上面的過程,φ與Θ的估算方式如下


在實際研究上,可以根據複雜度(perplexity)測量模型的成效來選擇主題的數量,愈小的複雜度表示模型的成效愈好。

本研究將1930-2009年區分為五個時期,每一個時期設定為50個主題,選擇該時期機率較大的五個主題視為是該時期的主要研究主題,並且對每個主題選出機率較大的詞語來了解該主題。研究發現如下圖

從開始時期(1930-1960)到現在(2000-2009)LIS的主題有本質上的改變,然而也有圖書館史(library history)、引用分析(citation analysis)與資訊檢索(information retrieval)等出現在多個時期內,這些可視為LIS的核心主題。

This work identifies changes in dominant topics in library and information science (LIS) over time, by analyzing the 3,121 doctoral dissertations completed between 1930 and 2009 at North American Library and Information Science programs.

The authors utilize latent Dirichlet allocation (LDA) to identify latent topics diachronically and to identify representative dissertations of those topics.

The findings indicate that the main topics in LIS have changed substantially from those in the initial period (1930–1969) to the present (2000–2009). However, some themes occurred in multiple periods, representing core areas of the field: library history occurred in the first two periods; citation analysis in the second and third periods; and information-seeking behavior in the fourth and last period.

Many evaluations of library and information science (LIS) have been conducted, primarily using the methods of content analysis and cocitation analysis on journal articles (e.g., Järvelin & Vakkari, 1993; White & McCain, 1998).

Although these studies constitute one lens on the field, there are some major limitations to the current literature in the area.
First, the focus on a single communicative genre (the journal article) provides a monocular view of the field. Research has shown that the writing and citing patterns of authors vary significantly by genre (Bazerman, 1988; Hyland, 2000). A different topic spectrum may be found by examining topics across multiple genres.
Second, the focus has been on either a group of highly cited authors or a sample of journal articles. Previous analyses have been manually intensive, necessitating small sample sizes. This has the potential to skew the results in two ways: (a) highly cited works are not necessarily representative of the works produced, and (b) a few articles/authors can heavily influence the results.
Lastly, the analyses have been largely synchronic, rather than diachronic. Therefore, trend data rely on replication studies, which are not prevalent in the literature.

Many quantitative analyses have been conducted to analyze the domain of LIS: content analyses of journal articles (e.g., Enger, Quirk, & Stewart, 1989; Fidel, 2008; Hider & Pimm, 2005; Järvelin & Vakkari, 1990, 1993; Kumpulainen, 1991), bibliometric analyses of journal articles (e.g., Åström, 2007, 2010), and bibliometric analyses of authors (e.g., Åström, 2010; Bates, 1998; Budd, 2000; Pettigrew & Nicholls, 1994; Levitt & Thelwall, 2009a, 2009b; White, 2001; White & McCain, 1998) to provide large-scale descriptions of the field.

Some analyses have focused on particular journals (e.g., Harter & Hooten, 1999; Lipetz, 1999; Liu, 2002; Park, 2010), conference proceedings (e.g., Smeaton, Keogh, Gurrin, McDonald, & Sødring, 2003), subject areas (e.g., Sugimoto & McCain, 2010), or countries (e.g., Cano, 1999; Uzun, 2002).

Scholars have also performed bibliometric analyses to examine the relationship between LIS and other disciplines (e.g., Borgman & Rice, 1992; Ellis, Allen, &Wilson, 1999; Meyer & Spencer, 1996; Odell & Gabbard, 2008; Sugimoto, Pratt, & Hauser, 2008).

The majority of quantitative analyses of LIS share four things: (a) the journal article is the focal communicative genre, (b) they are synchronic, rather than diachronic, (c) they focus on relationships between journals and/or journal authors (rather than topic analysis), and (d) those focusing on topic analysis use methods of co-occurrence or content analysis.

Some notable exceptions (on at least one of the four points) are the works by Smeaton et al. (2003) and Sugimoto and McCain (2010), which looked at changes in topics in information retrieval over time; Harter and Hooten’s (1992) bibliometric study of the Journal of the American Society for Information Science & Technology articles for three time slices; Åström’s (2007) cocitation analysis of LIS journal articles for three periods; and Järvelin and Vakkari’s (1993) content analysis of journal articles for three periods.

Some work has been done to examine the value of various methods of topic analysis, comparing the results found through cocitation with co-word analysis (e.g., Braam, Moed,&van Raan, 1991a, 1991b) and the difference between using titles, author-supplied, or indexer-supplied keywords (Whittaker, Courtial, & Law, 1989). ... Scholars have also criticized the use of co-occurrence analysis of terms, noting the large variance in meanings of individual terms based on the level of textual aggregation under investigation (Leydesdorff, 1997).

However, the manual intensity of content analysis becomes difficult, as the number of dissertations in the discipline has gone from a few hundred to a few thousand. Although content analysis can provide high granularity for individual works, it becomes difficult to assess the entire body of work without automatic techniques.

Latent Dirichlet allocation was proposed by Blei, Ng, and Jordan (2003) as a generative probabilistic model useful for discovering underlying topics in collections of data.

Expansions of LDA have also been used to understand correlations between topics (Blei & Lafferty, 2007), authors (Rosen-Zvi, Grifftihs, Steyvers,&Smyth, 2004), academic networks (Tang, Jin, & Zhang, 2008), social networks (McCallum,Wang, & Corrada-Emmanuel, 2007), social tagging communities (Li et al., 2010), and changes in topic over time (Pruteanu-Malinici, Ren, Paisley,Wang, & Carin, 2010; Rzeszutek, Androutsos, & Kyan, 2010).

Two exceptions to this are Griffiths and Steyvers’ (2004) analysis of abstracts from the Proceedings of the National Academy of Science (PNAS) from 1991–2001 and Zheng, McLean, and Lu’s (2006) analysis of the bioinformatics literature (from MEDLINE abstracts). These studies found that LDA performed well, in that the latent structure mimicked some characteristics of explicit structure (such as categorization schemes). Moreover, the studies displayed the ability of LDA to analyze the rich underlying structures of the domain— depicting emerging and sustained trends in a given discourse.

In LDA, a topic is characterized by a distribution over words and then documents are represented as random mixtures over latent topics (p. 996). As a three-level hierarchical Bayesian model, each topic node is sampled repeatedly. The result is that words may be repeated within topics and documents may be associated with more than one topic.

This model was extended to what is called the author–topic model (Rosen-Zvi et al., 2004), which is used in the present analysis. In this model, not only is each document a mixture of probabilistic topics, but each author is also seen as a mixture of probabilistic topics. In the same way that a topic can be generated from multiple topics, this extended model recognizes that an author can be “about” multiple topics (within a single document or across documents).

The author–topic model allows us to not only examine which topics were most salient across the various periods, but also which authors are most associated with these topics.



The figure can be explained as follows:
1. Θ is the probability of a topic given an author x; α is a hyperparameter for Θ.
2. φ is the probability of a word w given a topic z; β is a hyperparameter for φ
3. ad provides for the fact that multiple authors can write a single document; x is a randomly selected author from ad. (Note that there are only single authors in this selection; however, it is still necessary to identify author x.)
4. Given author x, we identify the topic z most likely to be associated with the given author.
5. Given topic z, we identify the words w most likely to be associated with the given topic.

Given the Bayesian network created from Figure 1, the joint probability of the author–topic pair was estimated using Gibbs sampling algorithm (Casella, 2001). This algorithm allows for an estimation of the unknown parameters of the model, namely, the probability of a topic given an author and the probability of a word given a topic. Gibbs sampling uses a successive Markov chain sampling to repeatedly draw x and z (as a pair), conditioned on all other variables. The process can be expressed as follows:



where nw[m][j] is the number of times a single word is assigned to a topic; nkwsum[j] is the number of times any word is assigned to a topic; na[x][j] is the number of times a single author is assigned to a topic; naksum[x] is the number of times any topic is assigned to an author.

Once this process has been iterated 2,000 times, the results of these variables are used to calculate Θ and φ.


Perplexity analysis is used to estimate the performance of the model. This analysis is used when we have an unknown probability distribution in the data. A lower perplexity value indicates better performance. As shown in Figure 2, the performance for our data stabilized at 50 topics.

At this point, all data were analyzed according to the author–topic model, divided into the five time slices. For each period, 50 topics were identified. Each topic contained a probability value—that is, the likelihood that the topic identified should be associated with the period. These topics were ranked by probability values and the top five were selected as being most representative of the period.

Similarly, a probability for each word was calculated to represent the association between a word and the given topic. These were ranked by probability values and the top 20 were chosen as most representative of the topic.

Lastly, the authors were assigned probability values for each topic and these too were ranked. The top five were chosen as highly representative authors for the given topic.

The results of the topic analyses are summarized in Figure 3, with the topics ranked from highest (top) to lowest for each period.



Some topics occur across multiple periods, including library history, librarianship, information use, citation analysis, classification, information retrieval (abbreviated as IR in Figure 3), and information-seeking behavior. These are the core topics in LIS dissertations from 1930 to 2009.

A limitation inherent with LDA analysis is in the manual interpretation and labeling of “topics.” Although some topics were fairly straightforward to label (e.g., Topic 5c, the top three loading words of which were (a) information, (b) seeking, and (c) behavior), others proved more difficult to ascertain the content or methodological relationship that connected the words and dissertations.

The top three topics by period identified in Järvelin and Vakkari’s (1993) content analysis of LIS journal articles is displayed in Table 7. These topics share some similarity with the present findings: classification is certainly a top theme in the first period, but there is an overwhelming emphasis on history in this period that is not represented by Järvelin and Vakkari’s results. ... The discrepancies between these findings lead to questions relative to genre and the possible impact of genre upon the topics. Further analysis should explore whether certain topics appear first in particular communicative genres and whether genres consistently emphasize different topics within a single field.



In terms of the highest loading specialties, this work confirmed an interest in information retrieval, bibliometrics, and information use consistent with the findings of White and McCain. However, the current analysis places a stronger emphasis on library evaluation, management, and education than was the case with White and McCain’s study. Although White and McCain did not undertake a detailed analysis on the changes in topics over time, they provided evidence for a shift toward the cognitive and user side of information retrieval.

Åström (2007) similarly evaluated LIS using cocitation analysis. He identified the main clusters by period as shown in Table 8.



For example, numerous studies summarized above noted the lack of theoretical work published in the LIS literature. It may be the case the journal articles are more likely to cover experimental work, whereas dissertations tend toward the theoretical.

However, the findings suggest that the bulk of dissertations completed at these schools do not have explicit connection to library practice.

Houser (1982) states that a “discipline is formed to solve a range of problems about some natural or social phenomena. . . [t]hese problems have a genealogy, that is, a continuity which forms the domain of the discipline” (p. 97).

Turner’s (2000) definition of a discipline focused on the identity of a shared name for a specialization and the exchange of scholars within that discipline (thereby propagating the discipline).

In examining dissertations, we were able to identify some dominant themes in LIS: These can be broadly defined as information-seeking, use, access, organization, and retrieval; and the education and training of the professionals providing these services.

沒有留言:

張貼留言