2014年2月28日 星期五

Chen, C. (2006). CiteSpace II: detecting and visualizing emerging trends and transient patterns in scientific literature. Journal of American Society for Information Science and Technology, 57(3), 359-377.

Chen, C. (2006). CiteSpace II: detecting and visualizing emerging trends and transient patterns in scientific literature.  Journal of American Society for Information Science and Technology, 57(3), 359-377.

information visualization

本研究提出一個整合研究專業(specialty)的研究前沿(research front)以及其引用的知識基礎(intellectual base)的視覺化介面。本論文定義研究前沿為研究專業上一組急遽出現的概念(concepts)與研究議題(research issues);研究前沿的知識基礎則是包含這些概念與研究議題的論文引用或者共同被引用的論文。在針對某一個專業進行其研究前沿與知識基礎進行視覺化時,首先蒐集專業相關的論文,從這些論文抽取代表研究前沿的詞語,並以論文所引用或共被引的論文做為專業的知識基礎,建立分別代表研究前沿的詞語和知識基礎的論文的二方網路(bipartite networks)以同時呈現研究前沿的相關概念與研究議題以及知識基礎的論文。在建立起來的網路上透過詞語和論文形成的叢集可以發現重要的研究前沿和知識基礎,藉由詞語呈現叢集的概念與研究議題更能有效地表達研究前沿的意涵,並且如果加上論文的發表時間來分析,可以從急遽出現在較多論文的相關詞語找出發展中的研究前沿。此外,對於網路進行中介中心性(centrality of betweenness)分析可以發現研究前沿間具有樞紐地位的論文,並且透過Pathfinder演算法可以發現論文間的主要關連。
A specialty is conceptualized and visualized as a time-variant duality between two fundamental concepts in information science: research fronts and intellectual bases.
A research front is defined as an emergent and transient grouping of concepts and underlying research issues.
The intellectual base of a research front is its citation and co-citation footprint in scientific literature— an evolving network of scientific publications cited by
research-front concepts.
The concept of a research front was originally introduced by Price (1965) to characterize the transient nature of a research field. Price observed what he called the immediacy factor: There seems to be a tendency for scientists to cite the most recently published articles. In a given field, a research front refers to the body of articles that scientists actively cite.
A specialty can be conceptualized as a time-variant mapping from its research front to its intellectual base.
Typical questions regarding a research front may include:
How did it get started? What is the state of the art? What are the critical paths in its evolution?
To address such questions, we need to detect and analyze emerging trends and abrupt changes associated with a research front over time. We also need to identify the focus of a research front at a particular time in the context of its intellectual base, to reveal significant intellectual turning points as a research front evolves, and to discover the interconnections between different research fronts.
Braam, Moed, and Raan (1991) defined a specialty as “focused attention by a number of scientific researchers to a set of related research problems and concepts” (p. 252). They studied the continuity and stability of a specialty in terms of the similarity between co-citation clusters across consecutive years. The similarity between two co-citation clusters is determined by comparing aggregated word profiles of the clusters.
In part, this is because we define a research front differently to emphasize emerging trends and abrupt changes as the defining features of a research front. A research front is the domain of a time-variant mapping, and its intellectual base is the co-domain of the mapping.
Griffith et al. (1974) found that between-cluster co-citation links tend to be weaker than within-cluster co-citation links. ... To understand how specialties and different thematic trends interact with each other, it is essential to study the nature of long-range, between-cluster links and understand why articles in different specialties were connected.
Labeling clusters is concerned with the clarity and interpretability of co-citation clusters. The standard approach relies on word profiles derived from articles citing a cluster of co-cited articles. ... Word-profile approaches have drawbacks. First, word profiles may not converge to a focused message. Analysts and users will make a substantial amount of sense-making efforts to synthesize a diverse range of word profiles. Second, cluster labels based on aggregating word profiles tend to be too broad to be useful. In practice, many users would be interested in not only the most commonly used terms but also terms that can lead to profound changes. Terms associated with an emerging trend could be overshadowed by a broader and more persistent theme.
In CiteSpace II, a current research front is identified based on such burst terms extracted from titles, abstracts, descriptors, and identifiers of bibliographic records. These terms are subsequently used as labels of clusters in heterogeneous networks of terms and articles.
CiteSpace II makes it easier for users to identify pivotal points. In addition to inspecting salient visual attributes, the user easily can see nodes with high betweenness centrality (Freeman, 1979).
The procedure of using CiteSpace II is described in the following steps, 
(1) Identify a knowledge domain using the broadest possible term.
(2) Data collection
(3) Extract research front terms: CiteSpace II first collects n-grams, or terms, from titles, abstracts, descriptors, and identifiers of citing articles in a dataset. The present study used single words or phrases of up to four words. ... Research-front terms are determined by the sharp growth rate of their frequencies.
(4) Time slicing
(5) Threshold selection
(6) Pruning and merging: Pathfinder network scaling is the default option in CiteSpace II for network pruning (Chen, 2004; Schvaneveldt, 1990).
(7) Layout
(8) Visual inspection
(9) Verify pivotal points
We demonstrate the new features of CiteSpace with case studies of two research fields: mass-extinction research (1981–2003) and terrorism research (1990–2003).
Mass-extinction research (1981–2003).
The input data for CiteSpace II were retrieved from citation index databases via the Web of Science based on a topic search for articles published between 1981 and 2003 on mass extinction. The scope of the search included four topic fields in each bibliographic record: title, abstract, descriptors, and identifiers. The search was limited to articles in English only.
The resultant dataset contains a total of 771 records.
A total of 333 research-front terms were detected from the four topic fields of these records.
Terrorism research (1990–2003).
The terrorism research (1990–2003) dataset consists of 1,776 records resulted from a topic search on terrorism in the Web of Science.
A total of 1,108 research-front terms were found.
The fully integrated representation of research fronts and intellectual bases in the same network visualization has three practical advantages.
First, using surged topical terms rather than the most frequently occurring title words is particularly suitable for detecting emerging trends and abrupt changes. In visualized networks, research-front terms are explicitly linked to intellectual-base articles. This design presents a compact representation of the duality between a research front and its intellectual base.
Second, research-front terms naturally lend themselves to be used as labels of specialties.
Third, it overcomes a common drawback of word-profile-based labeling approaches. Aggregated word profiles may not converge to an intrinsic focus. Terms selected based on sudden increased popularity measures are particularly suitable to characterize a current research front.
The Pathfinder algorithm extracts the most salient patterns from a network, but it does not scale well. CiteSpace II implements a concurrent version of the algorithm. The concurrent Pathfinder algorithm has substantially optimized the network scaling module, although it still took 6,000 seconds to process 14 networks and merge them into a 1,704-node network.
In conclusion, the new features introduced to CiteSpaceII for detecting and visualizing emerging trends and abrupt changes in a field of research have produced promising and encouraging results. The major findings are that
• the surge of interest is an informative indicator for a new research front;
• using heterogeneous networks of terms and articles provides a comprehensive representation of the dynamics of a specialty;
• research-front terms are informative cluster labels;
• citation tree-ring visualizations are visually appealing and semantically interpretable;
• betweenness centrality metrics identify semantically valid pivotal points.

2014年2月27日 星期四

Chen, C. and Morris, S. (2003). Visualizing evolving networks: minimum spanning trees versus pathfinder networks. In IEEE Symposium on Information Visualization 2003, Oct. 19-21, 2003, Seattle, Washington, USA, 67-74.

Chen, C. and Morris, S. (2003). Visualizing evolving networks: minimum spanning trees versus pathfinder networks. In  IEEE Symposium on Information Visualization 2003, Oct. 19-21, 2003, Seattle, Washington, USA, 67-74.

information visualization

本研究從網路的型態(topological)與動態(dynamical)兩方面比較Minimal Spanning Tree(MST)和Pathfinder Network(PFNet)兩種連結縮減的方法在共被引網路(co-citation network)的應用。連結縮減處理的目的在於使共被引網路能夠更清楚地呈現出重要的連結(論文或作者的共被引關係)與節點(論文或作者),因此一方面需要保留原本網路的型態,但另一方面當以論文引用的時間順序呈現時,能夠從網路的變化了解學科專業的演進。本研究發現經過MST處理的共被引網路會保留連結程度(degree)較高節點周圍的連結,所以可以維持原本網路的結構,但因為某些路徑上的重要連結被移除,因此不足以表達學科專業的演進;然而從PFNet所產生的網路不僅可以對應到學科專業的主題,同時在以共被引的時間呈現時也能夠發現到主題內以及主題之間的發展情形。
We compare the visualizations of co-citation networks of scientific publications derived by two widely known link reduction algorithms, namely minimum spanning trees (MSTs) and Pathfinder networks (PFNETs).
Two criteria are derived for assessing visualizations of evolving networks in terms of topological properties and dynamical properties.
The results suggest that although high-degree nodes dominate the structure of MST models, such structures can be inadequate in depicting the essence of how the network evolves because MST removes potentially significant links from high-order shortest paths. In contrast, PFNET models clearly demonstrate their superiority in maintaining the cohesiveness of some of the most pivotal paths, which in turn make the growth animation more predictable and interpretable.
The shortage of comprehensive examinations of the evolution of citation networks is due to various reasons, including the lack of an overarching framework that accommodates underlying theories and system functionalities across relevant disciplines, the lack of integrated network analysis and visualization tools, the lack of widely accessible longitudinal citation network data, and the lack of tools that specifically facilitate the analysis of network evolution.
A common problem with visualizing a complex network is that a large number of links may prevent users from recognizing salient structural patterns.
In fact, an MST is a special case of a Pathfinder network because a Pathfinder network is the set union of all the possible MSTs derived from a network [Schvaneveldt 1990].
In order to achieve a network of high clarity and legibility, it is necessary to impose the so-called triangular inequality throughout the network. While this requirement leads to the simplest representation of the essence of an underlying proximity network, this is at a considerable computational cost. Additionally, as the size of the original network increases, the algorithm requires a considerable amount of memory to run.
The most widely known graph drawing techniques include force-directed graph drawing algorithms and spring-embedder algorithms [Eades 1984]. ... These algorithms, however, face some challenges in terms of efficiency, especially in terms of scalability, which is closely related to the clarity of a visualized network.
A commonly used strategy to reduce clutter is to reduce the number of links. There are several ways to achieve this goal. Three popular ones are analyzed below.
The first option is imposing a link weight threshold and only include links with weights above the threshold [Zizi and Beaudouin-Lafon 1994]. ... However, it does not take the intrinsic structure of the underlying network into account, so the transformed network may not preserve the essence of the original network.
The second option is extracting a minimum spanning tree (MST) from a network of N vertices and reducing the number of links to N – 1.
The third option is imposing constraints on paths and excluding links that do not satisfy the constraints, for instance, as in Pathfinder network scaling [Schvaneveldt 1990]. ... The topology of a PFNET is determined by two parameters q and r and the corresponding network is denoted as PFNET(r, q). The q-parameter specifies the maximum length of a path subject to the triangular inequality test. The r-parameter is the Minkowski metric used to compute the distance of a path. The most concise PFNET for visualization is PFNET (q = N–1, r = inf) [Chen 2002; Chen and Paul 2001; Schvaneveldt 1990].
Most network growth models draw upon the rich-get-richer notion and cumulative advantage. As a result, if the degree of a node indicates its “richness,” a node with a higher degree will have a better chance to receive the next new link than a node with lower degree. In a citation network, this means that a highly cited article is more likely to be cited again than a less frequently cited article. This type of growing mechanism is known as preferential attachment.
It appears to be particularly problematic to identify significant topological and dynamical patterns in such visualization models because of the high density of the underlying network.
An et al. [2001] suggested that the evolution of citation networks could be useful in predicting research trends and in studying a scientific community’s life span.
Two criteria are derived based on the above analysis for qualitatively evaluating network visualization.
The first criterion for selecting a preferable topological structure of a visualized network is the presence of hubs, or stars, in derived networks. ... A star pattern indicates the star node carries the most information, processes the highest cue validity and the most differentiated from one another. ... Existing studies appear to suggest that co-citation counts are likely to form such star patterns in both MST and PFNET.
Criterion II requires that the changes of topological properties over time must preserve the integrity of emergent trends or patterns. Visualizing network evolution should not merely inform users of changes of individual nodes and links; rather, it is essential to inform users how an intrinsically cohesive structure changes locally and globally in organically.
In this study, research fronts were identified by agglomerative clustering using only papers that had at least five bibliographic coupling counts with some other paper in the dataset. Similarity calculation was based on Salton's cosine coefficient [Salton 1989] applied to bibliographic coupling counts. The titles for each research front were derived manually by exploring titles of papers within each research front for common themes.
Base reference clusters were formed by agglomerative clustering using only references that had been cited 10 or more times. Similarity calculation was based on Salton's cosine coefficient applied to co-citation counts. For each base reference cluster, labels were found by using the label of the research front that contained the most citations to references in the cluster.
A map of the references in the pathfinder network was produced identifying each reference by its base reference cluster membership, which allowed labeling of sections of the pathfinder network based on base cluster labels.
In general, due to the arbitrary choice inherited from the MST algorithms, one cannot guarantee the uniqueness of an MST. As a result, an MST may not preserve all the necessary links for representing the growth of a co-citation network. If this is the case, then important diffusion patterns may be distorted or inadequately represented by the extracted MST model.
The 516-node PFNET (q = N – 1, r = inf) is shown in Figure 3. The two parameters q and r were chosen to ensure that the extracted PFNET has the least number of links.
The animated PFNET visualization model demonstrated that nodes with similar colors often emerged simultaneously and formed local structures. And these local structures were reinforced by the timely emergence of salient co-citation links. The growth process can be represented by the dynamics shown in such local structures. Features such as continuity, predictability, and local cohesiveness in the PFNET indicated that the second criterion was met.

Chen, C. (1997). Structuring and visualizing the WWW with Generalized Similarity Analysis. Proceedings of the 8th ACM Conference on Hypertext (Hypertext '97), 177-186.

Chen, C. (1997). Structuring and visualizing the WWW with Generalized Similarity Analysis. Proceedings of the 8th ACM Conference on Hypertext (Hypertext '97), 177-186. Retrieved August 27, 2012, from http://delivery.acm.org/10.1145/270000/267456/p177-chen.pdf?ip=211.76.242.1&acc=ACTIVE%20SERVICE&CFID=108526370&CFTOKEN=44188441&__acm__=1346053732_db4b154c3eaabf4b1429f52a8e30ba0c
vis_paper

本論文以PathFinder方法提供網頁(或網站)視覺化的呈現,並且根據網頁彼此間的超文件連結(hypetext linkage)、內容相似度(content similarity)和瀏覽樣式(browsing patterns)來衡量它們的接近度(proximity)。具體而言,本研究利用網頁間的連結數目比率、向量空間模式(vector space model)以及網頁間的狀態轉移機率(state transition probability)來估算彼此間的接近度。以網頁做為圖形(graph)上的節點(vertices),網頁間估測的接近度做為節點間連結線的強度,然後再藉由Pathfinder方法在保留網絡的主要型態下,去除不必要的連結,作者認為Pathfinder比多維尺度法(MDS)能夠更精確地表現圖形在區域間的關係(local relationship)。
This paper describes a generic approach to structuring and visualizing a hypertext-based information space on the WWW. This approach, called Generalised Similarity Analysis (GSA), provides a unifying framework for extracting structural patterns from a range of proximity data concerning three fundamental relationships in hypertext, namely, hypertext linkage, content similarity and browsing patterns.
Pathfinder networks are used as a natural vehicle for structuring and visualizing the rich structure of an information space by highlighting salient relationships in proximity data.
Georgia Institute of Technology’s WWW User Surveys [17] shows that 69. 1% of users regarded the delay in downloading Web pages as a major problem and 34.5% of users identified the difficulty of finding an existing page. In particular, 14.3% of the users reported the difficulty of visualizing where they have been and where they can go and 6.5% identified the classic hypertext problem — lost in hyperspace. The memory overload remains a problem when navigating the WWW.
Ideally, spatial relationships in visualization should be determined by some psychological judgments of proximity, such as similarity, dissimilarity and relatedness.
Pirolli, Pitkow and Rae’s study [18] and HyPursuit [20] are two notable examples of taking into account hypertext linkage, content similarity and usage information on the WWW.
In HyPursuit, document similarity by linkage is defined as a linear combination of three components: direct linkage, ancestor and descendant inheritance.
Pirolli, Pitkow and Rao [18] developed a model which characterises documents on the WWW by various attributes associated with these documents, such as the number of incoming and outgoing hyperlinks of a document, how frequently the document was downloaded from the hosting WWW server and content similarities between the document and its children.
Sequential patterns of browsing indicate, to some extent, document relatedness perceived by users. For example, the number of users who followed a hyperlink connecting two documents in the past were used in [18] to indicate the degree of relatedness between the two documents.
Furnas’ fisheye views model is based on a “degree of interest” (DOI) function which assigns a value to each node in accordance with the degree to which a user would be interested in seeing that node [14, 12]. ... A fisheye view can be generated with a threshold so that only nodes with sufficient DOI are displayed in the view. ... By choosing a different API function, one can produce a fisheye view which emphasizes a particular type of structural patterns[ 12]. For example, the number of times that a node has been visited can be used to define a user-centred fisheye view, in which popular nodes will be highlighted for easy access.
In this paper, we focus on extracting underlying relationships in a hypertext information space and representing resultant patterns for structuring and visualizing the information space. Existing techniques such as fisheye views can be subsequently incorporated into such systems with improved spatial configuration mechanisms.
This definition also takes into account the overall connectivity of the document Di, which can be related to the ROC metric defined in [2].
In this study, we use the well-known tf x idf model, term frequency times inverse document frequency, to build term vectors. ... The document similarity is computed as follows based on corresponding vectors.
We have applied a state transition approach to extracting behavioral patterns of users with a hypertext system [6]. The dynamics of a browsing process can be captured by state transition probabilities. Transition probabilities can be used to indicate document similarity in the nature of browsing.
Pathfinder provides a more accurate representation of local relationships than techniques such as multidimensional scaling (MDS)[10]. Pathfinder has been applied to a number of human-computer interaction problems [10].
The topology of a PFNET is determined by two parameters q and r and the corresponding network is denoted as PFNET(r,q). The q-parameter constrains the scope of minimum-cost paths to be considered. The r-parameter defines the Minkowski metric used for computing the distance of a path.
When a PFNET satisfies the following 3 conditions, the distance of a path is the same as the weight of the path:
1. The distance from a document to itself is zero.
2. The proximity matrix for the documents is symmetric; thus the distance is independent of direction.
3. The triangle inequality is satisfied for all paths with up to q links. If q is set to the total number of nodes less one, then the triangle inequality is universally satisfied over the entire network.
The number of links in a network can be reduced by increasing the value of parameter r or q. The distance between nodes in a network is the length of the minimum-length path connecting the nodes; such a path is known as the geodesic connecting the nodes. A minimum-cost network (MCN), PFNET(r=INF, q=n- 1), has the least number of links.
The major advantage of Pathfinder networks is that salient relationships among documents are extracted by patterns associated with minimum-cost paths. This type of information filtering improves the clarity and quality of the information produced by information visualization systems based on spring models. Users are able to see how documents are related to each other.
GSA has some distinct features. 
(1) GSA emphasizes that users can substantially benefit from explicit, graphical representations of salient relationships in hypertext systems, and these graphical representations should be incorporated into user interfaces so as to reduce cognitive burdens on users in browsing.
(2) Each component model in GSA can be used independently for extracting structures of a particular type so that users may contrast patterns in distinct characteristics. In contrast, related work such as [ 18] combines various features into a monolithic feature vector, Consequently, the resulting inter-document relationship is a combined effect of a range of factors. Users may not be able to assess how documents are related along a specific dimension.
(3) GSA focuses on relationships that are particularly essential for hypertext systems and these relationships are preserved in resulting network representations. Many existing information visualization techniques are based on storage information such as file-size and last modification time, and often use hierarchical structures as the basis of visualization. Differences between the two approaches should be evaluated by further empirical studies.
For example, a Pathfinder network becomes increasingly cluttered as the number of documents in the underlying information space increases. There are several possible ways to deal with this issue. One is to use existing display techniques such as fisheye views, which provide adequate access to specific local information as well as contextual structure.
Similar documents are naturally placed near to each other in the space. Users can gain a birds-eye view of the global structure by moving up to a higher view point in the sky and have a close look by moving down to a view point closer to the target document.

Chen, C. M., & Paul, R. J. (2001). Visualizing a knowledge domain's intellectual structure. Computer, 34(3), 65-71.

Chen, C. M., & Paul, R. J. (2001). Visualizing a knowledge domain's intellectual structure. Computer, 34(3), 65-71.
vis_paper
本論文進行ACA(author citation analysis)的研究,以IEEE Computer Graphics and Applications上發表論文的作者為分析對象,選擇353位被引用5次以上的作者,利用他們之間的共被引資訊建立網路圖,結果共有28,638條連結線。經過尋徑網路尺度(pathfinder network scaling)的處理,保留下355條比較重要的連結線。為了發現電腦圖學與應用的專長(specialties),本論文借鏡於White and McCain(1998)的研究,利用PCA(principal component analysis)方法對共被引資料進行因素分析(factor analysis),結果共得到60個專長,5個較大的專長共可以解釋39%的變異數,而這5個專長分別是Rendering and ray tracing、Computer vision、Geometric modeling and computer-aided design、Volume rendering和Modeling nature。同時也在網路圖上呈現被歸類為這5個專長的作者,來觀察他們在網路圖上的分布情形。
ACA, a special type of citation analysis, focuses on intellectual connections between authors as reflected through the scientific literature. The author co-citation relationship links two authors by how often other authors reference their work together. Author co-citation patterns provide the basis for constructing an alternative view to a knowledge structure.
Pathfinder uses a filtering criterion known as the triangle inequality condition to determine whether to remove or retain each link in the original network. Triangle inequality requires that the length of a path connecting two points in the network should not be longer than the length of other alternative paths connecting the two points, but go through extra intermediate points.
We began by studying author co-citation patterns found in IEEE Computer Graphics and Applications magazine for a period of 18 years. ...  Among them, we entered into the author co-citation analysis only the 353 authors who received more than five citations in CG&A. Although this snapshot derives from a limited viewpoint—the literature of  computer graphics certainly stretches beyond the scope of CG&A— intellectual groupings of these 353 authors provide the basis for visualizing the computer graphics knowledge domain. ... The original author co-citation network contains as many as 28,638 links, which constitutes 46 percent of all possible links, excluding self-citations. Because this many links would clutter visualizations, we applied Pathfinder network scaling to reduce their number to 355.
We enhanced the network by coloring it according to the results generated using principal component analysis (PCA). PCA identified 60 specialties in computer graphics. The largest (rendering and ray tracing) and second-largest (computer vision) accounted for 13 percent and 11 percent of the variance, respectively. The five largest specialties accounted for 39 percent of the variance. Remaining specialties are relatively small.
Factor 1: Rendering and ray tracing.
Factor 2: Computer vision.
Factor 3: Geometric modeling and computer-aided design.
Factor 4: Volume rendering.
Factor 5: Modeling nature.
The knowledge landscape visualizes intellectual structures. A virtual landscape like this provides an intuitive gateway for users to access the scientific literature. Researchers new to a field can gain a useful overview by using the knowledge landscape to establish their own mental model of the field and track the development of their own domain.

Chen, C., & Carr, L. (1999). Trailblazing the literature of hypertext: Author co-citation analysis (1989-1998). Proceedings of the 10th ACM Conference on Hypertext (Hypertext '99), 51-60.

Chen, C., & Carr, L. (1999). Trailblazing the literature of hypertext: Author co-citation analysis (1989-1998). Proceedings of the 10th ACM Conference on Hypertext (Hypertext '99), 51-60.
vis_paper
本論文以9屆(1987-1998)的ACM Hypertext 學術研討會會議論文為研究資料,運用作者共被引分析(author co-citation analysis, ACA)、Pearson相關係數分析(Pearson’s correlation coefficients)、因素分析(factor analysis)等技術,探討超文件處理與應用學術領域的研究專長(specialties),並利用尋徑網路尺度(Pathfinder network scaling)將研究專長分析所產生的結果進行視覺化。在這個研究裡,共分析367位引用次數較多的作者之間的共被引現象,結果共產生39個因素,這些因素共解釋了87.8%的變異數。若以前四個因素而言,則解釋了52.1%。從因素內的作者來命名,前四項超文件處理與應用學術領域的研究專長分別是經典(Classics)、資訊檢索(Information retrieval)、圖形使用者介面與資訊視覺化(Graphical user interfaces and information visualisation)以及連結與連結機制(Links and linking mechanisms)。
The ultimate goal of our work is to realise the vision of making the best use of an interrelated information space and building one’s own threads of association. As one step in this direction, we explore a new paradigm of structuring and visualising a domain-specific information space.
In this study, we choose the field of hypertext as the subject domain and map the literature of hypertext based on the ACM Hypertext conference proceedings (1987-1998).
The idea of mapping the tracks of science is explained by Garfield in [8]. The aim of such work is to identify research front specialties in a field of study. A specialty is characterised by its influence on the development of a given field. One can tell a specialty by the number of citations that it receives.
In 1981, Institute for Science Information (ISI) published ISI Atlas of Science in biochemistry and molecular biology [10]. The Atlas was constructed based on co-citation index associated with publications in the field over a limited period of one year. 102 distinct clusters of articles were identified, which were called research front specialties, in order to give researchers a snapshot of significant research activities in biochemistry and molecular biology.
White and McCain [17] used author co-citation analysis to map the field of information science. ... Their study also included a factor analysis, in which major specialties were identified. One of the most remarkable findings is that the field of information science consists of two major specialties with litter overlap between their memberships: experimental retrieval and citation analysis.
In a series of studies, we have been investigating the role of Pathfinder network scaling techniques in reducing the excessive number of links and extracting the most salient structures from a range of proximity data [3]. One problem we repeatedly encountered is an interpretation problem: users found hard to make sense the nature of links selected by Pathfinder. ... A simple and easy-to-understand method is needed to explain the structure of a Pathfinder network, especially when the nodes are high dimensional in nature.
Following [17], the raw co-citation counts were transformed into Pearson’s correlation coefficients using the factor analysis. These correlation coefficients were used to measure the proximity between authors’ co-citation profiles. ... In the factor analysis, principal component analysis with varimax rotation was used to extract factors. The default criterion, eigenvalues greater than one, was specified to determine the number of factors extracted. ... Pearson correlation matrices were submitted to the GSA environment for processing, especially including Pathfinder network scaling and VRML-scene modelling.
Thirty-nine factors were extracted from the 367 x 367 author co-citation data set. These factors explain 87.8% of the variance. In particular, the top four factors alone explain 52.1% of the variance.
Factor 1: Classics.
Factor 2: Information retrieval.
Factor 3: Graphical user interfaces and information visualisation.
Factor 4: Links and linking mechanisms.
Pathfinder networks can provide more accurate information about local structures than multidimensional scaling maps [13]. We found that the provision of explicit links in our maps made it easier to interpret interrelationships among different data points.
Furthermore, author co-citation maps provide a means of identifying research fronts, i.e. specialties in the field, and a visual aid of interpreting the results of factor analysis.

2014年2月15日 星期六

Wouters, P., & Leydesdorff, L. (1994). Has Price's dream come true: Is scientometrics a hard science?. Scientometrics, 31(2), 193-222.

Wouters, P., & Leydesdorff, L. (1994). Has Price's dream come true: Is scientometrics a hard science?. Scientometrics, 31(2), 193-222.

本研究利用Scientometrics期刊論文以及其參考文獻為研究資料,根據多種資訊判斷科學計量學領域是否已經是硬科學,並分析這個領域的其他特性。本研究所使用的科學計量學資訊有引用文獻的相對年齡(relative age of the cited literature)、論文作者間的關係、論文題名的詞語模式(patterns of words  in the titles of these articles)等。

根據Price的知識增長理論(theory of knowledge growth),科學家會引用本身領域的文獻,因此,如果有研究前沿(research fronts)存在於這個領域,便會產生立即效應。Price指標(Price index)可以測量立即效應(immediacy effect),Price指標較大表示引用文獻的相對年齡較低,例如Price(1970)測得生物化學和物理的Price指標值約在60%到70%,社會科學大約在42%附近。Crane(1972)則認為科學會形成作者間彼此緊密相連的社群,因此本研究分析作者間的合著關係和引用的關係,並且利用網絡分析技術探討科學社群的凝聚程度,並且測量作者在網絡的位置連結性以及結構的相似性,根據這些資訊進行叢集,集結彼此間連結性強的作者形成一個叢集,或是形成位置相似的作者叢集。另外,Rip and Courtial (1984)和Leydesdorff (1989a)
指出題名上的詞語可視為是出版品的認知訊息(cognitive message)的指標,詞語在題名上的共現可視為是詞語間關係存在的紀錄,因此本研究也利用網絡分析技術探討詞語的共現網絡。

研究結果發現分析的779筆Scientometrics期刊論文資料,除了前三年快速的增加外,平均每年增加3.5筆,並且這些論文資料共包含12341筆參考文獻,平均每篇論文有15.8筆參考文獻。各項指標都相當穩定。Price指標的平均值為43.0%,若以每年的Price指標的平均值在34.0%到51.4%之間。

以作者資料來看, 779筆論文資料共計由669位不同的作者完成,有接近3/4的作者(488位)僅出現在一筆論文資料上,每位作者平均出現在1.8筆論文資料上,其作者生產力符合Lotka分布,並且大部分(61%)的論文是單一作者,平均每篇論文有1.6位作者。合著作者的論文資料中,大多數的作者都僅和一到兩位同事合作,合著網絡相當破散,但幾個較大的網絡與作者在同一機構任職、參與同一研究計畫或者具有共同的研究興趣有關。

Scientometrics期刊論文的作者引用網絡則呈現高度凝聚的狀態。779筆論文資料中有441筆被其他Scientometrics期刊論文引用,每筆Scientometrics期刊上的論文引用的論文平均有19.4%同樣是Scientometrics期刊的論文。發表超過1篇以上論文的作者共有181位,其中的130位作者有引用其他129位作者的資料,利用作者之間的彼此互相引用關係,發現形成的集團(clique)大多與作者任職機構有關,也有一個集團是成員間曾彼此辯論(debate)而產生。最後,題名上的詞語共現網絡也同樣有高度凝聚的情形。從上面的資訊可以判斷科學計量學領域已經由多種的學科背景在認知與社會性上整合而成,但並沒有發現研究前沿的現象。

In more than one respect, Scientometrics displays the characteristics of a social science journal. Its Price Index amounts to 43.0 percent, and is remarkably stable over time.

The majority of the published items in Scientometrics has been written by a single author. Moreover, the network of co-authorships is highly fragmented: most authors cooperate with no more than one or two colleagues.

Both the citation networks of the authors and the network of title words indicate that the field is nonetheless highly cohesive.

The characteristics of the publications in this journal, and the patterns of the bibliometric relations among them, may therefore indicate the type and extent of the cognitive and social integration of the various disciplinary backgrounds into scientometrics as a field.

The question, in other words, is how "hard" scientometrics is, and how strongly its knowledge is codified. These properties can be measured in terms of:
a) the relative age of the cited literature, the so-called "Price Index",
b) the relations among the authors of articles published in Scientometrics; and
c) the pattern of words in the titles of these articles.

According to Price's theory of knowledge growth (Price 1965), science distinguishes itself from other fields of study by the way scientists refer to their literature (Price 1970). The existence of "research fronts" in science supposedly leads to an "immediacy effect", which can be measured in terms of the so-called "Price Index".

The Price Index is defined as "the proportion of the references that are to the last five years of literature" (Price 1970). Price estimated that this index would vary between 22 and 39 percent if no immediacy effect were present. [1] A field that was all research front and with no general archive might have a Price Index of 75 to 80 percent.

From his analysis of 162 journals, Price (1970) concluded: "Perhaps the most important finding I have to offer is that the hierarchy of Price's Index seems to correspond very well with what we intuit as hard science, soft science, and nonscience as we descend the scale." Biochemistry and physics are at the top, with indexes of 60 to 70 percent, the social sciences cluster around 42 percent, and the humanities fall in the range of 10 to 30 percent.

Science is, on the whole, practised in tightly knit communities in which the authors address one another (Crane 1972).

Co-authorship relations can be considered as indicators of co-operation. [3]

The meaning of citation relations is less clear, given the ongoing citation debate (MacRoberts and MacRoberts 1989; Cozzens 1989; Luukkonen 1990; Leydesdorff and Amsterdamska 1990; Woolgar 1991). But whatever the precise meanings of citations may be, citations can be considered as sociometric data, and the resulting network can accordingly be analyzed (cf. Shrum and Mullins 1988).

We analyzed the extent to which the authors are connected to one another, i.e. the cohesiveness of the network, as well as the pattern displayed by each author in relation to all other authors, i.e. the position of authors in the network.

We also analyzed the similarities among authors in both these dimensions of the matrices, i.e. we clustered strongly connected authors as well as authors in similar positions. Direct as well as indirect linkages between the authors are involved in this analysis.

Strong cliques are sets of authors connected by relations in such a way that all members of the clique are connected to one another, and anyone for whom this holds is included in the clique. The inclusion criterion is less strong for weak cliques, in which all pairs within the clique must have relationships with all other pairs, and anyone with a relation to or from a member of the clique is included.

Strong structural equivalence clusters are sets of authors with completely identical positions in the network (the distance dP between them is zero). Weak structural clusters are sets of authors with a significant similarity in their patterns of relations (the distance dP is small).

As noted, we wished to know whether a structurally codified semantics of scientometrics exists or whether, on the contrary, the articles in Scientometrics use the different terminologies of the various disciplines surrounding scientometrics.

Given the functions of titles of articles, the words in these titles can be considered as indicators of the cognitive message of the publication (Rip and Courtial 1984; Leydesdorff 1989a). The co-occurrence of words in titles can be considered as an indication of the existence or non-existence of relations between these words (Callon et al. 1983).

Since 1978, 779 items have been published in Scientometrics. They contain 12,341 references to the scientific literature. [10]

The number of publications per year in the journal increases in a linear way (Fig. 1). [11] After a steep growth during the first three years, the number increases by 3.5 publications per year.

The number of references per year shows a comparable pattern, although somewhat more irregular. Every publication contains on average 15.8 references (cf. Yitzhala, 1991). Since 1986, this number has become stable at an average of 15 references per publication (Fig. 2).

The publications in Scientometrics were written by 669 different authors. On average, every author published 1.8 times and every paper was written by 1.6 authors. Nearly three-fourth of the authors (488 or 73 percent) published only once in Scientometrics. The distribution of productivity among the authors is a Lotka distribution (Fig. 4).

The average Price Index of Scientometrics is 43.0 percent. ... The Price Index varies between 34.0 and 51.4 percent (Fig. 5). The regression line is not significant. [12] Apparently, the index displays neither rise nor fall since 1978.

Recently, Schubert and Maczelka (1993) concluded from an analysis of Scientomettics in 1980-81 and 1990-91 that the journal has moved slightly from the "soft" (social) towards the "harder" (natural) sciences. They drew this conclusion from the rise of the Price Index from 35 percent to 42 percent between these measurement points. This observation is, however, based on only two measurements. Because of the statistical fluctuations in the value of the Price Index over time, any conclusion can be drawn regarding the development of the Price Index if one restricts oneself to only two measurement points.

In accordance with Price's theory, the number of references to literature of a specific age rises until the cited literature is two years older than the citing literature, and then falls off (Fig. 7). Note that this decline is gradual. Apparently, only a small "immediacy effect" is visible in scientometrics.

A general phenomenon in science is the growth of the number of co-authored scientific articles, relative to the total scientific production (Luukkonen et al. 1992; Abt 1992).

In Scientometrics, however, 61 percent of the articles have been written by a single author. This share is stable over time.

The network of co-authorships is highly fragmented. ... With the exception of three subgroups, most co-authors cooperate with no more than one or two colleagues.

Comparison of the composition of the weak structural equivalence clusters with the relational cliques reveals that two clusters are identical: a group of authors from Leiden (Van Raan et al.) and a group of authors with various institutional affiliations, probably best characterized as the "co-word analysis group". So, these two groups have distinct identities, with respect both to their relations and to their positions in the network.

Some clusters seem constituted by the institutional affiliations of the authors. This holds for the Leiden group and for the authors around ISI (cluster 3). In other cases, nationality appears to be the binding force. This holds for the group in Hungary (cluster 5), the Belgian informetricians (cluster 10) and the Spanish scientometricians (cluster 6). However, cluster 1 can best be characterized by its research program (co-word analysis). Cluster 2 seems to consist of authors from Sussex together with CHI Research Inc. Thus, co-author relations are not only institutionally defined; shared interests and common intellectual goals play a role as well.

To sum up, scientometrics is a fragmentary field of co-authorships. The authors are highly selective in their co-authorship relations with one another. Co-authorships are defined neither exclusively by social nor only by intellectual factors. Both dimensions shape the pattern of co-authorships.

With respect to the number of solitary authors and the large number of isolated small clusters, scientometrics exhibits the pattern of a social science.

Of the 779 articles published in Scientometrics, 411 were subsequently cited one or more times in Scientometrics. The share of references to Scientometrics (as a percentage of all references) has stabilized around an average of 19.4 percent since 1987.

Of the 181 authors in the core set, 130 authors cite one another. So, 51 (or 28.2 percent) of the authors publishing more than one article in Scientometrics from 1978 till 1993 are neither citing nor cited within this group of authors.

The core set of authors in Scientometrics is found to be highly cohesive in terms of their mutual citation relations. All these authors are members of one single weak clique. Moreover, a majority of these authors (88) also belongs to one strong clique (Table 5).

The picture is different if we exclude all indirect relations from the analysis. This "fine structure" of the citation matrix is shown in Table 6, where 13 strong cliques and 6 weak cliques are revealed. Most strong cliques seem to coincide with shared institutional affiliations. The exception is clique 9, which indicates the existence of a debate among the members of this clique.

The most striking feature of the network of title words of articles published in Scientometrics is its cohesiveness. All words cluster together in a single strong component clique (Table 8). If only direct relations are included, all words cluster together in a single weak component clique. This means that all words are either used together in a title or share a common co-word.

Thus, the language of scientometrics is both strongly unified and weakly codified. This strong cohesiveness is a stable characteristic of the titles in Scientornetrics, from the very start of the journal. Perhaps a distinct discourse already existed before the journal was founded. In any case, it constitutes a textual identity of scientometrics as a field, one probably different from the various mother disciplines. Thus a process of de-differentiation seems to have occurred not only in the patterns of citing (and being cited) but also at the cognitive level.

The interpretation of the Price Index is complicated because of these variations within disciplines. If we, nevertheless, take the Price Index preliminary as an indicator of "hardness", scientometrics belongs to the group of relatively hard social sciences. At the same time, it stays unequivocally within the social science range. Taken literally, Price's dream has therefore not come true, since he postulated the emergence of a completely new type of social science with a natural science character. But if we reformulate his goal a posteriori in a more modest way, as the building of a relatively hard social science, it did come true.

The value of the Price Index appears stable over the years. Since a number of other indicators also exhibit stability, this seems to suggest the existence of some scientometric identity. For example, the journal expands at a regular rate, while the percentage of co-authored papers increases only very slowly. The origin of this stability can best be explained by the finding that the community of researchers who have published more than once in Scientometrics acts as a tightly knit network.

In addition to the co-authorships within various institutes, and partly overlapping with this structures, there are national co-authorship relations, like those among the Belgian informetricians, and programmatic co-authorship relations, like those among the users of the French co-word instrument. In general, co-authorship relations are firmly embedded in existing social structures, both at the national and at the community level.

These various strong graphs of co-authors, however, are structurally embedded in the communication structure as indicated by textual indicators. Both in terms of citation relations and in terms of title-words the network is very cohesive, while the structural dimensions of codification are less clear.

In summary, the community of authors publishing in Scientometrics is well integrated, while there are no indications of an exclusive paradigm or a research front.

2014年2月9日 星期日

Schoepflin, U., & Glänzel, W. (2001). Two decades of" Scientometrics". An interdisciplinary field represented by its leading journal. Scientometrics, 50(2), 301-312.

Schoepflin, U., & Glänzel, W. (2001). Two decades of" Scientometrics". An interdisciplinary field represented by its leading journal. Scientometrics, 50(2), 301-312.

本研究以Scientometrics期刊論文的參考文獻資料,對科學計量學領域的特性進行分析。過去的研究裡,Schubert and Maczelka (1993)利用1980-1981和1990-1991兩個時段Scientometrics期刊論文的參考文獻年齡,計算兩個時段的Price指標(Price index),發現後面時段的Price指標較大於前一個時段的Price指標,因此判斷科學計量學領域正趨向較硬的社會科學。Wouters and Leydesdorff (1994)認為只取樣幾個時段的資料可能不準確,應該觀察資料的連續變化情形,他們的研究指出Price指標在1978到1992年間的變化並不大,所以這個社會科學是較穩定的、並沒有朝向應科學發展的趨勢。Glänzel and Schoepflin (1994)則認為科學計量學是異質性很高的研究領域,這個領域的作者來自不同的學科,擁有相當不同的傳播、引用和發表方式,這種現象使得這個科際整合的領域趨向多元發展。

本研究以1980、1989和1997三個年度的Scientometrics期刊論文的參考文獻做為研究資料,計算以下四個統計資訊:1) 每篇論文的Price指標(the Price index per paper)、2) 參考文獻是連續出版品的百分比(the percentage of references to serials)、3) 參考文獻的平均年齡(the mean reference age)和4) 平均參考文獻數(the mean reference rate)。各年度的論文數與指標如下表所示:


三個年度的期刊Price指標相當穩定,並且所有的指標都反映Scientometriics類似社會科學期刊的模式。此外,表上也呈現每一年度論文的Price指標和連續出版品比率的中位數(median),這些數據皆明顯大於期刊的數據,表示許多論文比整體期刊呈現的情形來得硬。

本研究並分派每一篇論文到一個類別,這些類別包括:1) 書目計量學理論、數學模型與書目計量學定律的公式化(bibliometric theory, mathematical models and formalisation of bibliometric laws);2) 案例研究與實務論文(case studies and empirical papers)、3) 方法學論文包含應用(methodological papers including applications)、4) 指標工程與資料呈現(indicators engineering and data presentation)、5) 書目計量學中的社會學取向,科學社會學(sociological approach to bibliometrics, sociology of science)、6) 科學政策、科學管理與廣泛或技術討論(science policy, science management and general or technical discussions)。論文在各類別及各年度的分布如下表所示


可以明顯地發現:1980年度的案例分析和方法學論文所占比例較小,但在後兩個年度則占了論文的大部分;相反的,科學政策從第一個年度後便減少,社會學取向也有相同的情形。本研究推測等Research Evaluation與Social Studies of Science等期刊吸引了相關的論文投稿可能是科學政策與社會學取向方面的論文數量減少的原因。

下表分析各分類論文的參考文獻指標,從這些數據可以看出不同類別的論文在指標上差異很大,特別是科學政策,因此當科學政策在第二和三個年度大幅減少後,期刊整體的指標便有很大的不同。因此,本研究認為科學計量學是異質性很高的領域。



The development of the field of bibliometric and scientometric research is analysed by quantitative methods to answer the following questions:
(1) Is bibliometrics evolving from a soft science field towards rather hard (social) sciences (Schubert-Maczelka hypothesis)?

(2) Can bibliometrics be characterised as a social science field with stable characteristics (Wouters-Leydesdorff hypothesis)?

(3) Is bibliometrics a heterogeneous field, the sub-disciplines of which have their own characteristics? Are these sub-disciplines more and more consolidating, and are predominant sub-disciplines impressing their own characteristics upon the whole field (Glänzel-Schoepflin hypothesis)?

The findings suggest, that the field is in fact heterogeneous, and each sub-discipline has its own characteristics.

Indeed, this journal covers almost the complete spectrum of bibliometric research. It publishes theoretical papers and papers on mathematical models as well as on the research evaluation of special fields and/or selected institutions, on science policy questions as well as articles on social studies of science and general discussions about the field.

While Schubert and Maczelka (1993) found a clear move from softer’ towards ‘harder’ (social) sciences between the analysed time periods 1980-1981 and 1990-1991, respectively, Wouters and Leydesdorff (1994) concluded on the basis of the change of Price’s Index in time that bibliometrics has not become a hard social science field in the observation period 1978-1992.

Glänzel and Schoepflin (1994) stated in their discussion paper that bibliometrics has become a heterogeneous field and sub-disciplines are drifting apart. Consequently, bibliometrics comprises sub-disciplines with distinctly different communication, citation and publication characteristics.

Assuming that bibliometrics is an interdisciplinary field and that authors coming from different fields bring their specific communication behaviour into it, we have classified all papers published in Scientometrics into different categories representing the main field-specific approaches to bibliometrics.

All source articles published in the journal Scientometrics in three sample years, 1980, 1989 and 1997, have been processed. All references cited in articles, notes and letters in the above three publication years were selected. Review articles have not been taken into consideration since the extent and structure of the reference lists of these documents are expected to distinctly differ from those of other research papers. Papers without references have been omitted.

References have been assigned to two categories, reference to serials (S) and reference to non-serials (N). All references have been classified manually.

The following statistics have been calculated.
1. The Price Index per paper. This index is defined as the percentage of references not older than five years in all references of an individual paper. This indicator has been introduced by Moed (1989).
2. The percentage of references to serials. The share of references assigned to category S in all references (N+S) cited by a journal or subfield expressed in percent.
3. The mean references age. The age of references cited in a journal or subfield are summed up and divided by the number of the references. This indicator can be determined also as a conditional mean, that is for both the subset of references in serials and non-serials separately.
4. The mean reference rate. This is the ratio of the number of references cited by a journal and the total number of papers published in the journal including those without references.

The “Price Index” is commonly used as a measure to distinguish between hard science, soft science, technology and non-science (see Price, 1970).

According to the results of an earlier study (Glänzel and Schoepflin, 1999), the percentage of references to serials proved to be a sensitive measure to characterise typical differences in the communication behaviour between the sciences and the social sciences.

The mean reference age also serves as an efficient measure of the “hardness” of science. In the paper by Glänzel and Schoepflin (1999), a comparison of the mean age of references and the Price Index has shown that the age of references is only in part reflected by the Price Index, in particular if the average age of references does not exceed about 15 years.

In addition, we calculated the mean reference rate, that is, the average size” of the reference list of a bibliometric paper published in Scientometrics. Although this cannot be considered a sensitive measure of the “hardness” of science, it reflects nevertheless additional field-specific characteristics (see Glänzel and Schoepflin, 1999).

In the next step, all selected source articles in Scientometrics have been assigned manually to one category out of a scheme of six. ... The classification scheme used for this study is presented in Table 1. The classification permits to group the material in several ways: the categories can be regarded from the viewpoint of core bibliometrics (2, 3, and 4) and background research (1, 5, and 6), but also with respect to theoretical (1, 3, and 5) and applied research (2, 4, and 6).


Table 2 presents the number of papers assigned to each category in 1980, 1989 and 1997.




However, the Price Index shows stability also in our case. Even more, all indicator values reflect patterns typical of social-science journals (c.f. Glänzel and Schoepflin, 1999).

It is worth mentioning here, that the median of paper-based indicators is greater both for the Price Index and the share of serials than the corresponding journal indicators. This phenomenon allows only one possible interpretation: Numerous papers are ‘harder’ than expected on the basis of the overall journal indicators.

There are obviously two dramatic developments: first, there is an impressing and steady growth of Case Studies, from a forth position in 1980 to the predominant first position in 1997. Second, there is a similarly impressive loss of share of articles on Science Policy and Discussions (category 6) from the predominant first position in 1980 to a minor category in 1997. This goes along with a steady loss in material with a sociological approach, too (category 5). On the other hand, there is a certain increase in Methodology (category 3), while Theory and Indicator Engineering remain minor classes.



If we now take a look on Core Bibliometrics as defined by categories 2, 3, and 4, it becomes obvious, that this group is practically reduced to Case Studies and Methodology, and by far dominating the total output of research as represented by the journal Scientometrtics.

The group characterised as Background Research (categories 1, 5, and 6) has continuously lost ground since 1980. Moreover, theoretical research in bibliometrics seems to be mainly a matter of methodology.

Following the differentiation of the field, journals like Social Studies of Science or the newly founded Research Evaluation publish a considerable share of bibliometric research in categories 5 and 6. But also journals in information science (e.g. JASIS, Information Processing & Management, or Journal of Information Science to name just a few Anglo-Saxon titles) attract bibliometric research articles.

On the detriment of a larger scope, Scientometrics has clearly become the forum for Case Studies and Methodology-oriented contributions.

The above-mentioned deviating patterns in 1980 can be at least in part explained by the great share of papers in category 6 and their low share of serials and relatively low age of the references. Indeed, the indicator values of the six categories give evidence of specific characteristics of the corresponding sub-disciplines.



The above trends and figures tell unambiguously against an evolution of bibliometrics towards a discipline of ‘hard’ social science (Schubert-Maczelka hypothesis). On the other hand, we cannot speak of stable characteristics, either (Wouters-Leydesdorff hypothesis). The indicators allow only an interpretation in the sense of the third hypothesis. The field is indeed heterogeneous, and each sub-discipline has its own characteristics. This may, of course, be at least in part caused by the deviating field-specific communication behaviour of the authors who bring traditional organisation schemes from their own fields into bibliometrics.

2014年2月8日 星期六

Chen, C., McCain, K., White, H., & Lin, X. (2002). Mapping Scientometrics (1981–2001). Proceedings of the American Society for Information Science and Technology, 39(1), 25-34.

Chen, C., McCain, K., White, H., & Lin, X. (2002). Mapping Scientometrics (1981–2001). Proceedings of the American Society for Information Science and Technology, 39(1), 25-34.

科學映射圖(science mapping)是整合資訊視覺化(information visualization)和科學計量學(scientometrics)的研究,藉由圖形呈現揭露科學文獻的結構與相關的專業(specialties),科學映射圖的最基礎技術為詞語的共現分析和共被引分析,分別提供獨特的科學研究前沿結構洞察力,Braam, Moed, & Raan (1991a, 1991b)的研究發現結合這兩種技術能夠讓出版品的認知內容(cognitive content of publications)產生更為清楚的圖像。

科學計量學是測量科學或技術進展的研究 (Garfield, 1979b)。傳統上科學計量學有相當強烈的應用導向,針對科學或技術的輸入與輸出發展出許多測量方法與指標,許多知識工作者以這些測量方法與指標為工具應用於各種不同的研究:例如可以針對國家、地區和研究機構的研發能力進行政策與計畫的評估,或是對於研究領域的知識結構進行領域分析。van Raan (1997)和Persson (2000)都是以Scientometrics期刊論文做為研究資料的研究。van Raan (1997)分析科學計量學的最佳狀態(the state of the art)以及對它的應用導向傳統進行描述,van Raan建議科學計量學需要和知識發現(knowledge discovery)與資料探勘(data mining)整合來獲得明顯的效益。Persson (2000)以1978到1999年,44卷,1062篇論文資料進行分析,找出最常被引用的出版品,並且產生期刊共被引、國家間的直接引用連結、作者間的共被引以及直接引用等圖形,表現各種不同的結構。

本研究以1981到2001年間的Scientometrics期刊論文為研究資料,選擇被引用次數達五次以上的參考文獻,共計403筆文獻,根據這些文獻的共被引資訊,繪製網路圖做為科學映射圖的基本圖形,並以論文的引用速率產生動畫的效果。本研究首先以文獻的共被引次數計算Pearson 相關係數(Pearson's correlation coefficients)產生共被引矩陣(co-citation matrix)。並且利用主成分分析(principal component analysis)對共被引矩陣進行因素分析,以分析出的因素代表領域的專業。同時也利用共被引矩陣產生網路圖,經過尋徑網路縮放(pathfinder network scaling)保留較重要的共被引連結,以簡化圖形的複雜性。最後以VRML(virtual reality modeling language)呈現圖形,並且以動畫呈現文獻的被引用率增長情形。本研究共計找出25個因素,較大的三個因素所對應的專業分別命名為科學研究中的引用(citations in science studies)、全球與國家的科學表現(world and national science performance)、研究產出的評估(evaluation research outputs)。

The design of the visualization model adapts a virtual landscape metaphor with document cocitation networks as the base map and annual citation rates as the thematic overlay. The growth of citation rates is presented through an animation sequence of the landscape model.

Science mapping aims to reveal structures of scientific literature and underlying specialties using graphical representations. ... Co-word analysis (Callon, Law, & Rip, 1986) and co-citation analysis (Small, 1973) are among the most fundamental techniques for science mapping. ... Each offers a unique perspective on the structure of scientific frontiers. Researchers have found that a combination of co-word and co-citation analysis could lead to a clearer picture of the cognitive content of publications (Braam, Moed, & Raan, 1991a, 1991b).

As an integral part of our long-term research, our investigation emphasizes an interdisciplinary synergy that may involve fields of study such as information visualization and scientometrics.

Can we provide domain analysts, science performance evaluators, researchers, students, and other knowledge workers something tangible and meaningful that they can readily incorporate it into their work process? Can we improve the way we learn about a new subject matter, the way we explore a knowledge domain, and the way we trace the history and evolution of a specialty? And ultimately, can we augment our ability to judge the significance of scientific work more efficiently and more accurately?

The present study is based on articles published in Scientometrics between 1981 and 2001, drawn from the Web of Science.

Scientometrics is “the study of the measurement of scientific and technological progress” (Garfield, 1979b). Its origin is in the quantitative study of science policy research, or the science of science, which focuses on a wide variety of quantitative measurements, or indicators, of science at large.

Input indicators include the amount of research grants awarded to institutions and the number of people receiving scientific degrees; output indicators include the number of scientific articles published, the number of citations to each article, and the number of patents granted.

Science policy and program evaluation studies have used such indicators to measure the scientific strength of various countries, regions, or research institutions.

Domain analysts have used such indicators to describe the intellectual structure of a knowledge domain.

Scientometric research has a strong application-oriented tradition (Garfield, 1979b; Raan, 1997).

Garfield (Garfield, 1979b) identified several publications appeared in the 1970s and contributed to the development of scientometrics, namely, the first Science Indicators published by the National Science Board in 1972 (Board, 1977), the Evaluative Bibliometrics: The Use of Publication and Citation Analysis in the Evaluation of Scientific Activity by Francis Narin and Computer Horizons, Inc. (CHI) in 1976 (Narin, 1976), which has been regarded as a good review source for anyone interested in scientometrics (Garfield, 1979b).

Derek Price’s 1965 article ‘Network of Scientific Papers’ (Price, 1965) has been also regarded as a key event in the development of the field of scientometrics.

Michael Moravcslk (1977) presented a review of scientometric literature (Moravcslk, 1977).

Anthony van Raan (1997) analyzed the state of the art of scientometrics and characterized its application-oriented tradition. He envisaged that scientometrics could benefit significantly from a greater integration with knowledge discovery and data mining.

Loet Leydesdorff (2001) identified some challenges of scientometrics and suggested that: “the state of the art of science studies is ‘preparadigmatic:’ it is an interdisciplinary area integrated only at the level of its subject matter, and an applicational area for various contributing disciplines.”

A directly related study of Scientometrics was done by Olle Persson (2000). He retrieved 1,062 articles published in the journal from volume 1 to volume 44 between 1978 and 1999. Top-10 most cited publications include (Garfield, 1979a; Schubert, Glanzel, & Braun, 1989; Small, Sweeney, & Greenlee, 1985). He generated several maps to show a variety of structures, including journal co-citation, direct citation links among countries, shared citations among authors, and direct citations among authors.

This study is based on bibliographc data retrieved from the Web of Science. The data contain all types of documents published in Scientometrics between 1981 and 2001. ... Each article must be cited for 5 times or more in order to be included in this integrated analysis. This threshold resulted in a total of 403 articles.

In this study we have adapted an integrated procedure of citation analysis and information visualization, including Pathfinder network scaling, Principal Component Analysis (PCA), and visual-spatial models rendered in Virtual Reality Modeling Language (VRML).

The cocitation strength is computed as Pearson’s correlation coefficients to form a co-citation matrix. ... The co-citation matrix forms the basis of a base map, a terminology commonly used in cartography.

Factor analysis, namely PCA, is subsequently applied to the co-citation matrix in order to produce a thematic overlay. The purpose of such a thematic overlay is to highlight the density distribution of various specialties. Factor loadings are used to color code each publication in the thematic overlay.

We simplify the cocitation matrix using Pathfinder network scaling, which retains the strongest co-citation links with reference to the so-called triangle inequality condition (Chen, 1997, 1998; Schvaneveldt, 1990).

Finally, the growth of citation rates is animated across the entire Pathfinder network to facilitate the identification of trends. The visualization-animation model is made available in VRML 2.0 for easy access on the Internet.

PCA identified 25 factors from the 403 by 403 co-citation matrix. In theory, each factor should correspond to a specialty. ... The large number of factors reflects the diversity of scientometrics.

In our analysis, we focus on the three largest factors of significant specialties of the field.
Specialty 1: Citations in Science Studies.
Specialty 2: World and national science performance.
Specialty 3: Evaluation research outputs.

2014年2月7日 星期五

Peritz, B. C., & Bar-Ilan, J. (2002). The sources used by bibliometrics-scientometrics as reflected in references. Scientometrics, 54(2), 269-284.

Peritz, B. C., & Bar-Ilan, J. (2002). The sources used by bibliometrics-scientometrics as reflected in references. Scientometrics, 54(2), 269-284.

Van Raan (1997)討論科學計量學的最佳狀態,強調這個領域需要平衡應用與基礎研究以及增強科學計量學和廣泛的學科之間的關係。過去已經有許多研究根據Scientometrics期刊論文的參考文獻,對科學計量學領域的特性進行分析,例如:Schubert and Maczelka (1993)利用1980-81與1990-91兩個期間的論文參考文獻,分析1980年代Scientometrics期刊的改變,所使用的指標包括參考文獻的年齡分布(age distribution)、Price指標(Price index)、引用出版品的分布、引用作者的分布以及最常引用的出版品等。Wouters and Leydesdorff (1994)分析前25卷的論文參考文獻,計算論文的平均參考文獻數量以及被引用文獻的相對年齡(relative age)。Schoepflin and Glänzel (2001)以1980、1989和1997年的論文參考文獻資料,計算Price指標、參考文獻為連續出版品(serial)的比率、參考文獻的平均年齡(mean reference age)和平均引用的參考文獻比率(mean reference rate)。

本研究比較Scientometrics期刊1990和2000年論文的參考文獻,共169篇論文(1990年70篇,2000年89篇。附註:此處有誤,合計只有159篇。),2814筆參考文獻,並將參考文獻的來源分為不同領域。結果發現1990年每篇論文平均有15.1(1054/70)筆參考文獻,2000年每篇論文平均有19.8(1760/89)筆參考文獻。Scientometrics期刊論文引用近五年的文獻比例,也就是Price指標,從1990年的37.6%,減少為2000年的31.6%。1990年有47.3%的參考文獻來自科學計量學與書目計量學(scientometrics and bibliometrics)、圖書資訊學(library and information science)以及社會學、歷史學與哲學(the sociology, history and philosophy of science),2000年增加為56.9%。兩個年度的作者自我引用情形沒有明顯差異,1990年為13.4%,2000年則為13.9%。然而該期刊的自我引用(journal self-citation)與引用期刊的百分比(the percentage of references to journals)等情形皆有增加。期刊的自我引用情形從1990年的12.9%增加為2000年的20.1%,可能的理由是Scientometrics愈來愈成為這個領域的核心期刊。

The aim of this study was to examine the extent to which the field of bibliometrics and scientometrics makes use of sources outside the field.

The results show that in 2000, 56.9% (and 47.3% in 1990) of the references originated from three fields: scientometrics and bibliometrics; library and information science; and the sociology, history and philosophy of science.

When comparing the two periods, there is also a considerable increase in journal self-citation (i.e., references to the journal Scientometrics) and in the percentage of references to journals.

Van Raan (1997) discussed the state-of-the-art of scientometrics, and emphasized the need to balance between applied and basic research in the field, and the importance of strengthening the relations of scientometrics with a broad spectrum of disciplines.

Peritz (1981) examined the references of research papers published in 39 core journals during five calendar years, and calculated the percentage of references outside the field. In another study, Peritz (1988) examined the literature for bibliometrics for the period 1960–1985, and classified the body of literature according to the field of the journal in which the article was published.

Al-Sabbagh (1987) studied the interdisciplinarity of information science through a reference analysis of JASIS. The findings, based on a ten percent random sample of references appearing in JASIS articles between 1970 and 1985 show that the largest percentage of references come from information science, followed by computer science, library science and science-general.

Thompson (1989), based on the references of articles in twenty library and information science journals in five selected years, studied the age of the references, the extent of self citation, and found that the list of most cited journals was almost exclusively from library and information science.

Cronin and Pearson (1990), in a study based on citations found that information science exports techniques of information retrieval and bibliometrics.

Meyer and Spencer (1996) analyzed citations to twenty-four library and information science journals over a twenty-year period. Their findings show that 86.6% of the citations come from library and information science, but other disciplines including computer science, medicine, psychology, the social sciences and general sciences also cite library and information science journals to some extent.

Rousseau (1997) studied the references appearing in the papers of the first two ISSI Conferences and citations of the Proceedings. He tabulated the most frequently cited publications – the three most frequently cited publication were JASIS, Scientometrics and J. Doc. The list of most frequently citing journals and of the most cited papers from the Proceedings was also presented.

The journal Scientometrics has been the “theme” of several previous bibliometric studies.

Schubert and Maczelka (1993) studied the changes that occurred to the journal during the 1980’s based on its reference patterns. Two periods, 1980–81 and 1990–91 were chosen. They calculated the age distribution of the references, the Price Index, the distribution of the cited publication, the distribution of cited authors, and tabulated the most frequently cited publications.

Wouters and Leydesdorff (1994) analyzed the references of all articles and notes in the first 25 volumes of Scientometrics, and calculated, among other indicators, the number of references per article and the relative age of the cited literature (the Price Index).

Persson (2000) maps the citation and reference patterns of Scientometrics based on volumes 1 to 44 of the journal.

Very recently Schoepflin and Glänzel (2001) calculated several bibliometric measures (the Price Index, percentage of references to serials, mean reference age and mean reference rate) for articles, letters and notes published in Scientometrics in 1980, 1989 and 1997.

Most of the previous studies of Scientometrics were either concerned with quantitative aspects (e.g.; the Price Index) or with citation and co-citation patterns.

This study analyzed all the references of all the papers published in Scientometrics in 1990 and in 2000. The population of the study consisted of 169 papers and 2814 references. ...The list contained 70 items for 1990 and 89 items for 2000, altogether 169 papers. ... In the set for 1990, 60 (86% of the total for 1990) items in the list were labeled as articles, while for 2000, 83 (93% of the total for 2000) items were labeled as articles.

The references were categorized according to six facets:
• author self-citation;
• journal self-citation;
• discipline of publication source;
• field self-citation;
• type of publication;
• year of publication.

Thus, first we had to define the major themes covered by the field:
• indicators (science and technology), forecasting and planning;
• research trends, research evaluation and funding;
• science policy;
• bibliometric laws and models;
• citation analysis including all aspects (e.g. obsolescence, ranking, mappings, coupling, etc.);
• patent analysis;
• reference analysis;
• coword analysis in context of performance;
• productivity (e.g. authors, journals, institutions);
• impact;
• peer review process;
• sociology of science;
• social contexts of research;
• characteristics and development of a scientific area;
• scholarly communication;
• scientific networks;
• technology flow;
• innovation;
• other themes relevant to the field.

Altogether, 2814 references were identified, 1054 in 1990 and 1760 in 2000.
The mean number of references in 1990 was 15.1, while in 2000 the mean increased to 19.8.

It is interesting to note, that the percentage of references in the last five years (1986 to 1990, for 1990; and 1996 to 2000 for 2000) decreased from 37.6% to 31.6%. Does this mean that scientometrics is getting “softer” (the “Price Index”, (Price, 1970))?

Both Schubert and Maczelka (1993) and Shoepflin and Glänzel (2001) found that the Price Index of scientometrics increased over time. They, as in the current work, based their data on single years.

On the other hand, Wouters and Leydesdorff (1994) studied the first twenty five volumes of Scientometrics, observed some fluctuations in the Price Index over the years, but showed that the regression line is not significant, and concluded that the index displays neither rise or fall between 1978 and 1992.

We also believe, that in order to draw conclusions about the “hardness” or “softness” of the field, its journal or journals should be studied over a continuous time period, and not isolated years.

A reference for a given item was labeled as author self-citation, if one of the authors of the reference matched one of the authors of the given item. ... Author self-citation was 13.4% for 1990 (141 references) and 13.9% (244 references) for 2000. ... In terms of author self-citation, no significant differences were observed between the two periods.

Journal self-citation, (i.e., references to the journal Scientometrics), on the other hand, increased considerably, from 12.9% in 1990 (136 journal self-citations) to 20.1% (354 journal self-citations) in 2000.

A possible explanation for this increase is that the journal Scientometrics is more and more becoming the central journal of the field.

The top four journals appear exactly in the same order for both years, and they represent the main aspects of the field: the field itself, its relation to information and library science, to planning and management and to the sociology of science. These four sources cover 18.5% of the references in 1990, and 28.1% of the references in 2000.

In the list of most frequently cited publications we see mostly journals, but also books, handbooks, yearbooks, collections, proceedings and reports. ... Table 4 shows that the percentage of the references to journal articles increased considerably, while the percentage of the references to books, yearbooks and reports decreased. In 1990 there were no references to electronic sources, this category only appeared in 2000. It will be interesting to see whether the electronic sources are going to be referenced more extensively in the future.

Along with the increasing citation rate of the journal Scientometrics, we observe a general increase in sources belonging to the field of scientometrics and bibliometrics. This could either be the sign that the field is becoming more mature or self-sufficient or it may indicate that scientometricians base their research less and less on methods and studies conducted in other fields.

Most of the references 2000 (56.9%), and nearly half of the references in 1990 (47.3%) are from the three fields, closely related to the subject-matter: scientometrics and bibliometrics itself; library and information science; and sociology and history of science.

A substantial amount of references are to sources belonging to the social sciences (21.3% in 1990, and 13.0% in 2000). We see that the percentage of references to sources from the social sciences decreased considerably.

On the other hand, the combined share of sciences-general; mathematics, computer science, statistics and engineering; science and medical sciences remained nearly the same (23.4% in 1990 versus 23.5% in 2000).

About half of the references originate from sources, which are not related to scientometrics. It is quite possible that some of these references are to works, which belong to the field.

On the other hand, some of the references from the fields closely related to scientometrics are not classified as field self-citation.

In 1990, 593 out of the 1054 references (56.3%) were classified as field self citation, while in 2000, 1092 out of the 1760 references (62.0%) were field self-citations. This is a rather considerable increase, it may indicate that the field is becoming more and more self sufficient, and needs to rely less on theories and methods emanating from other scientific fields.

The results show that the field relies heavily on itself, on library and information science and on sociology, history and philosophy of science.

There is an increase in journal self-citation, the list of core journals remaining stable for both periods. Author self-citation is around 20% for the years under study.