2014年1月25日 星期六

Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping library and information science. Information Processing & Management, 42(6), 1614-1642.

Janssens, F., Leta, J., Glänzel, W., & De Moor, B. (2006). Towards mapping library and information science. Information Processing & Management, 42(6), 1614-1642.

本研究利用詞語共現分析(co-word analysis)技術,區分出六個圖書資訊學的研究主題:兩個書目計量學主題、一個資訊檢索主題、一個一般議題、一個網路計量學主題以及一個專利研究主題。

詞語共現分析根據詞語共同在文件出現的現象描述文件的內容,利用共同出現的相對強度呈現領域的概念網絡(concept networks)。目前已經有植物生物學(de Looze and Lemarie, 1997) 、凝態物理(Bhattacharya and Basu, 1998)、化學工程(Peters and van Raan, 1993)、資訊檢索(Ding, Chowdhury, and Foo, 2001)以及 醫學(Onyancha and Ocholla, 2005)等多個領域曾利用詞語共現分析技術來研究領域內的概念網絡。Van Raan and Tijssen (1993)討論基於詞語共現分析的書目計量在知識論的潛力(epistemological potentitals)。相較於共被引分析,詞語共現分析能應用在沒有引用索引的資料,而且共被引分析會因為在領域的變動與趨勢以及引用者的行為而變得複雜(Noyons & van Raan, 1998)。雖然Leydesdorff (1997)認為詞語的意義隨它們與其他詞語關係的頻率及其出現位置,會有所改變;但Courtial (1998)則是認為詞語共現分析中的詞語,並非做為用來代表某種意義的語言單位,而僅僅是文本間的連結指標。

本研究列舉幾個應用文字資訊為基礎的書目計量方法在圖書資訊學研究主題分析的研究:Courtial(1994)以詞語共現分析對這個領域進行探討,結果發現這個領域包含傳統圖書館學、資訊檢索、科學計量學、資訊計量學、專利分析以及最近興起的網路計量學。Glänzel及其同事整合全文為基礎的結構分析(full-text based structural analysis)和傳統的書目計量方法探討書目計量學及其次領域(Glenisson, Glänzel, and Persson, 2005; Glenisson, Glänzel, Janssens, and De Moor, 2005; Janssens, Glenisson, Glänzel, and De Moor, 2005)。

本研究所使用的分析技術包括:文本抽取(text extraction)、前處理(preprocessing)、多維度尺度(multidimensional scaling)以及Ward’s階層叢集(Ward's hierarchical clustering),並且利用向量空間模式(vector space model) (Salton & McGill, 1986)和隱藏語意分析(latent semantic analysis) (Deerwester et al., 1990)測量文件間相似程度的估計值。以論文彼此間的相似程度,將論文映射成二維圖形的結果如下,此圖形並且標示出每篇論文的期刊:

Scientometrics的論文主要分布在標示為1與2的兩個橢圓附近,橢圓1的主題為書目計量,橢圓2則為專利分析。橢圓5上的論文主要來自Information Processing and Management和Journal of the American Society for Information Science and Technology,其主題為資訊檢索。橢圓12的論文傾向於社會方面的主題,除了Journal of the American Society for Information Science and Technology以外,還包括Journal of Information Science和Journal of Documentation。正中央標示為14的橢圓,其主題與網路相關,所有的期刊均有這個主題的相關論文。

以Ward's叢集分析將所有論文進行歸類,最佳的結果共分為六個叢集。本研究並且根據每個叢集上論文的重要詞語以及中心的論文給予叢集的名稱。在二維圖形上標示各種叢集的結果如下:

六個叢集可以圖形上的斜線分為兩群,斜線以下為Bibliometrics1、Bibliometrics2和Patent Analysis,以上則為Webometrics、Information Retrieval和Social Aspects,但六個叢集中以Patent Analysis和其他叢集較分離。書目計量相關論文分為兩個叢集:Bibliometrics1和Bibliometrics2。Bibliometrics1與科學裡的合作關係(collaboration in science)、引用分析(citation analyses)和國家研究成效(national research performance)等主題相關,Bibliometrics2則主要為方法學和書目計量理論相關的論文。

為了找出各期刊分別著重的主題,除了比較上面的兩個圖形,另外還將叢集和期刊的關係映射成圖形。結果發現Information Processing and Management和Information Retrieval幾乎重疊,這個現象表示Information Processing and Management上的論文和Information Retrieval十分相關。Social Aspects和Webometrics相當靠近Journal of the American Society for Information Science and Technology、Journal of Information Science和Journal of Documentation三種期刊。事實上,除了Scientometrics以外,Social Aspects和其他期刊的距離大約相等。最後,Scientometrics則是落在Bibliometrics1、Bibliometrics2和Patent Analysis構成的三角形中心。

The optimum solution for clustering LIS is found for six clusters. The combination of different mapping techniques, applied to the full text of scientific publications, results in a characteristic tripod pattern. Besides two clusters in bibliometrics, one cluster in information retrieval and one containing general issues, webometrics and patent studies are identified as small but emerging clusters within LIS.

The method was developed by Callon, Courtial, Turner, and Brain (1983), more than two decades ago, for purposes of evaluating research. The methodological foundation of co-word analysis is the idea that the co-occurrence of words describes the contents of documents. By measuring the relative intensity of these co-occurrences, simplified representations of a field’s concept networks can be illustrated (Callon, Courtial, & Laville, 1991).

Van Raan and Tijssen (1993) have discussed the ‘‘epistemological’’ potentials of bibliometric mapping based on co-word analysis.

Leydesdorff (1997) analysed 18 full-text articles and sectional differences therein, and considered that the subsumption of similar words under keywords assumes stability in the meanings, but that words can change both in terms of frequencies of relations with other words, and in terms of positional meaning from one text to another. This fluidity was expected to destabilize representations of developments of the sciences on the basis of co-occurrences and co-absences of words.

However, Courtial (1998) replied that words, in co-word analysis, are not used as linguistic items to mean something, but as indicators of links between texts.

Many researchers have used this methodology to investigate concept networks in different fields, among others, de Looze and Lemarie (1997) in plant biology, Bhattacharya and Basu (1998) in condensed matter physics, Peters and van Raan (1993) in chemical engineering, Ding, Chowdhury, and Foo (2001) in information retrieval (IR) and Onyancha and Ocholla (2005) in medicine.

The reason why the emphasis has shifted from co-citation analysis to co-word techniques is twofold. The first reason is a practical one; co-word analysis allows application to non-citation indexes as well. The second relates to methodology; co-citation analysis complicates the combined analysis of field dynamics and trends in the actors’ activity (Noyons & van Raan, 1998).

Bonnevie (2003) has used primary bibliometric indicators to analyse the Journal of Information Science, while He and Spink (2002) compared the distribution of foreign authors in Journal of Documentation and Journal of the American Society for Information Science and Technology.

Bibliometric trends of the journal Scientometrics, another important journal of the field, have been examined by Schubert and Maczelka (1993), Wouters and Leydesdorff (1994), Schoepflin and Glänzel (2001), Schubert (2002), Dutt, Garg, and Bali (2003).

The main journals of the field were also analysed in terms of journal co-citation and keyword analyses (Marshakova, 2003; Marshakova-Shaikevich, 2005).

The co-citation network of highly cited authors active in the field of IR was studied by Ding, Chowdhury, and Foo (1999).

Finally, Persson (2000, 2001) analysed author co-citation networks on basis of documents published in the journal Scientometrics.

Courtial (1994) has studied the dynamics of the field by analysing the co-occurrence of words in titles and abstracts. Courtial described scientometrics as a hybrid field consisting of invisible colleges, conditioned by demands on the part of scientific research and end-users. Although this situation might have somewhat changed during the last decade, this conclusion illustrates how heterogeneous the much broader field of LIS – comprising subdisciplines such as traditional library science, IR, scientometrics, informetrics, patent analyses and most recently the emerging specialty of webometrics – nowadays is.

In recent papers, Glenisson, Gla¨nzel, and Persson (2005), Glenisson, Gla¨nzel, Janssens, and De Moor (2005), Janssens, Glenisson, Gla¨nzel, and De Moor (2005) have applied full-text based structural analysis in combination with ‘‘traditional’’ bibliometric methods to bibliometrics and its subdisciplines.

The full-text analysis consisted of text extraction, preprocessing, multidimensional scaling, and Ward’s hierarchical clustering (Jain & Dubes, 1988).

In short, the textual information is encoded in the vector space model using the TF-IDF weighting scheme, and similarities are calculated as the cosine of the angle between the vector representations of two items (see Salton & McGill, 1986; Baeza-Yates & Ribeiro-Neto, 1999).

The term-by-document matrix A is again transformed into a latent semantic index Ak (LSI), an approximation of A, but with rank k much lower than the term or document dimension of A. A latent semantic analysis is advisable, especially when dealing with full-text documents in which a lot of noise is observed.

One advantage of LSI is the fact that synonyms or different term combinations describing the same concept are mapped on the same factor, based on the common context in which they generally appear (Berry et al., 1995; Deerwester et al., 1990).

A lot of time was devoted to the detection of phrases. Since the best phrase candidates can be found in noun phrases, the programs LT POS and LT CHUNK4 have first been applied to detect all noun phrases in the complete document collection.

MDS represents all high-dimensional points (documents) in a two- or three-dimensional space in a way that the pairwise distances between points approximate the original high-dimensional distances as precisely as possible (see Mardia, Kent, & Bibby, 1979).

The agglomerative hierarchical cluster algorithm using Ward’s method (see Jain & Dubes, 1988) was chosen to subdivide the documents into clusters. ... One of the disadvantages of agglomerative hierarchical clustering is that wrong choices (merges) that are made by the algorithm in an early stage can never be repaired (Kaufman & Rousseeuw, 1990). What we sometimes observe when using hierarchical clustering is the forming of one very big cluster and a few small very specific clusters.

The journal Scientometrics can be largely separated from the other journals (which is also confirmed by the different term profile in the table of Appendix 1), and exhibits two different foci (best visible in Fig. 4).



The first ‘‘leg’’, indicated by the ellipse with number 1 and by and large containing the first focus of the journal Scientometrics, clearly contains papers in bibliometrics. The 10 best TF-IDF terms for ‘‘leg’’ #1 are: citat, cite, impact factor, self citat, co citat, scienc citat index, citat rate, isi, countri and bibliometr.

The second ‘‘leg of Scientometrics’’, indicated by number 2, is characterised by the best terms patent, industri, biotechnolog, inventor, invent, compani, firm, thin film, brazilian and citat. The JIS paper (#3) embedded in this patent ‘‘leg’’ might be considered an outlier for that journal, but it was put in the right place since it is concerned with ‘‘The many applications of patent analysis’’ (Appendix 2: Breitzman & Mogee, 2002).

An important focus of LIS is indicated by ellipse #5 and can be profiled as ‘‘Information Retrieval’’ (IR) when looking at the highest scoring terms: queri, search engin, web, node, music, imag, xml, vector and weight.

The fourth distinguishable subpart of LIS (#12) is about digit, internet, servic, seek, behaviour, health, knowledg manag, organiz, social and respond; so encompassing the more social aspects.

The remaining large subpart is somewhat the central part (#14). It consists of papers leading to a mean profile containing the terms web, web site, classif, domain, web page, languag, scientist, region, catalog, and web impact factor.

The term network of Cluster 1 allowed the conclusion that the papers belonging to this cluster are concerned with domain studies, studies of collaboration in science, citation analyses, national research performance and similar issues.



The medoid is a paper by Persson et al. on ‘‘Inflationary bibliometric values: The role of scientific collaboration and the need for relative indicators in evaluative studies’’ (Appendix 2: Persson et al., 2004). This is a methodological paper with strong implications for research evaluation, combining research collaboration with citation analysis and construction of national science indicators.

The smaller bibliometrics cluster (Cluster 3: manually labelled as ‘‘Bibliometrics2’’) is of more methodological/theoretical nature.




The medoid is the state-of-the-art report ‘‘Journal impact measures in bibliometric research’’ (Appendix 2: Gla¨nzel & Moed, 2002).

The term networks for the two bibliometrics clusters just described contain a few overlapping terms (bibliometr, chemistri, citat, citat rate, cite, cluster, countri, impact factor, isi, physic, rank and scienc citat index). The MDS plot of Fig. 15 confirms that there is no clear border between Bibliometrics1 and Bibliometrics2, but that there is a gradual transition.

The almost tiny Cluster 2 (19 papers, Fig. 10) represents patent analysis.


A paper on ‘‘Methods for using patents in cross-country comparisons’’ forms the medoid of this cluster (Appendix 2: Archambault, 2002).

Cluster 4, with 282 papers, is the largest one. We have labelled it ‘‘Information Retrieval’’.


The medoid paper is entitled ‘‘Querying and ranking XML documents’’ (Appendix 2: Schlieder & Meuss, 2002).

Cluster 5, with 62 papers, belongs to the small clusters. Both terms and papers close to the medoid characterise this cluster as ‘‘Webometrics’’.


The medoid paper is entitled ‘‘Motivations for academic web site interlinking: evidence for the Web as a novel source of information on informal scholarly communication’’ (Appendix 2: Wilkinson et al., 2003).

Cluster 6 (213 papers) proved to be the most heterogeneous cluster. We have labelled it ‘‘Social’’, however, we could also have called it ‘‘General & miscellaneous issues’’.



‘‘Approaches to user-based studies in information seeking and retrieval: a Sheffield perspective’’ is the title of the medoid paper (Appendix 2: Beaulieu, 2003).


The Patent cluster can be clearly separated from the rest of LIS. The subspace under the line is almost completely occupied by Bilbiometrics1, Bibliometrics2 and Patent.




IR and IPM almost collide in this 2D projection (Fig. 20). This means that Cluster 4 (‘‘IR’’) is very close to the scope of this journal.

The ‘‘Social’’ cluster with general and miscellaneous topics as well as ‘‘Webometrics’’ are close to JIS, JDoc and JASIST, too. Moreover, the ‘‘Social’’ cluster is almost equidistant to all traditional journals in Information Science.

The remaining three clusters, namely Bibliometrics1, Bibliometrics2 and Patent, form a triangle in the centre of which the journal Scientometrics is located. The relatively large distances among these clusters and between each cluster and the journal, strongly indicate that a quite large spectrum of bibliometric, technometric and informetric research using different vocabularies is covered by the journal Scientometrics. This observation is in line with the findings by Schoepflin and Gla¨nzel (2001) that scientometrics consists of several subdisciplines such as informetric theory, empirical studies, indicator engineering, methodological studies, sociological approach and science policy; and that case studies and methodology became dominant by the late 1990s. At the end of the 1990s, also technology related studies based on patent statistics became an emerging subdiscipline of the field.

We have found two clusters in bibliometrics, of which a big one in applied bibliometrics/research evaluation and a smaller one in methodological/theoretical issues; also we have found two large clusters in information retrieval and general and miscellaneous issues and, finally, two small emerging clusters in webometrics and patent and technology studies. Within the IR cluster, we have found a small subcluster on music retrieval, which might be a temporary phenomenon since the journal JASIST has published a special issue on this topic.

According to the expectation, IR, General issues and Webometrics were represented by four of the five journals, namely JIS, IPM, JASIST and JDoc, while the two bibliometrics and the patent clusters were the domain of the journal Scientometrics.

沒有留言:

張貼留言