2015年12月23日 星期三

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics, 6(1), 19-36.

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics6(1), 19-36.

文件視覺化(document visualization)是一種資訊視覺化技術,將詞語、文句、文件或它們之間的關係等文字資訊轉換為視覺形式,使得使用者在面臨大量的文件時可以更好的了解文件、減輕他們的心理負荷。文件通常較缺乏結構(minimally structured),但有豐富的特徵(attributes)和後設資料(metadata),因此相較於文本視覺化(text visualization),文件視覺化主要著重在文件以及其包含的特徵和後設資料上。以下是幾種可能的應用:(1) 詞語的頻次與分布; (2) 語意內容與重複 (semantic content and repetition); (3) 區別文件集群的主題; (4) 文件的核心內容;(5) 文件間的相似性;(6) 文件間的連結;(7) 文件內容改變的過程;以及 (8) 社交媒體上的資訊擴散與其他模式以及做為改善文本搜尋的方式。

本研究以視覺化的對象(visualization objects)與任務對蒐集到的文件視覺化技術進行分析,視覺化的對象分為單一文件、文件集合、串流文本訊息以及檢索結果,以下對各種視覺化任務進行說明:

單一文件的視覺化目的在快速了解與吸收核心內容與文本特徵,著重在詞語、片語、語意關係和內容上,分為三種類型:
1.呈現詞語頻次、分布與語彙結構等語彙特徵的語彙為基礎 (Vocabulary-Based)視覺化:
重要的技術有Tag Clouds [6,7]與Wordle [8,9],這類技術利用位置、顏色與大小等方式呈現單一文件中的詞語頻次,近來parallel tag clouds (PTC) [10]、 ManiWordle [11]、 context preserving dynamic word cloud[12]和visualization of internet discussion with extruded word clouds. [13]等許多研究以這類方法為基礎,並加以改善。其他屬於這類但原理不同的方法還有TextArc [14]和DocuBurst [16]。
2. 呈現實體與其間關係的語意結構視覺化:
Semantic Graphs [19]利用Penn Treebank產生的剖析樹(parse tree),產生每一句子內的主語-動詞-受語,在解決代名詞指代(pronominal anaphors)問題,將各實體相連,產生語意圖(semantic graph)。
3.呈現文件內容(Document Content)的特性與關係為基礎的視覺化:
例如WordTree [23]以樹狀結構表現詞語的上下文脈絡,樹的根節點是使用者選取的詞語,每一個分支表示詞語在文件上的上下文,節點大小代表各詞語的頻次。Arc Diagrams [25] 以半圓形的弧連結重複的次序列,用來顯示內容上重複的複雜模式(complex patterns of repetition)。

文件集合的視覺化在於顯現文件群集上的主題、文件間的相似與差異以及內容在時間上的改變,相關技術可分為
1. 文件主題的視覺化:目的在發現特定的主題以及反應各個不同主題之間的關係,著名的研究案例有ThemeScapes [26]和 INSPIRE的 ThemeView以及 The Galaxy [27],分別以地形圖和散佈圖表現文件在主題上的分布情形,地形圖上各「山脈」的高度表示主題的強度。TopicNets則以網路圖的節點與連線呈現文件間在主題上的關係。ThemeRiver [29]和Topic Island [30]著重在文件集合內主題在時間的變化,以ThemeRiver [29]來說,X軸表時間,Y軸上則以不同顏色的「河流」代表各主題,河流的寬度表現主題主題在相關文件上的強度。
2. 文件核心內容的視覺化:目的在提供整個文件集合的概觀,例如Document Cards [35]用來呈現大量的文件集合,每一張卡片上包含文件上重要的詞語和影像,而詞語是由文本探勘(text mining)技術由文件上的文本抽取出來,影像則從文件上抽取或然後加以組合。
3. 版本更動的視覺化:呈現各版本上的差異。例如:History Flow [36] 的設計是用來顯示維基百科上不同版本的文件內容更動情形以及相對應的作者;另外,也有許多針對軟體程式碼發展的視覺化。
4. 文件關係的視覺化:發現在不同文件上實體的連結,實體包括人、地點、日期和組織等等,提供這個功能的視覺化技術如Jigsaw [46] 。其他的技術,如ContexTour [47]和PivotPaths [49] 為視覺化被使用在論文集合的應用、FacetAtlas [48]則將Google Health文件上的病因、症狀、處方和診斷等實體相連,
5. 文件相似性的視覺化:其目的在將相似的文件置於彼此接近的位置,並且能遠離不相似的文件。過去常利用自組織映射圖 (self-organizing map, SOM) [50],將高維度的資料映射到2維平面上呈現,並使得資料間複雜而非線性的關係能夠以距離方式表現,著名的例子有 Lin 的研究[52]和WEBSOM [53]。

在文本視覺化的應用方面,由於近年社交媒體的盛行,即時的串流文本處理的研究大為盛行,研究問題包括主題與詞語的統計分析與表現、主題相關事件的凸顯以及文本訊息本身的視覺化 [54],Christian Rohrdantz [55]進行了串流文本資料的即時視覺化相關研究的回顧,Whisper [57] 的研究可以追蹤社交媒體上的資訊擴散過程。另一個應用是檢索結果的視覺化介面,早期的研究成果如TileBars [58],Sparkler[59]可同時將多個查詢問句的結果以視覺化的方式呈現,RankSpiral [60] 的視覺化呈現重點在比較多個查詢問句或不同搜尋引擎的檢索結果。

各種技術所提供的取用方法、對文件的要求與主要特色可參考下表



關於主要特色說明如下:
1. 擴充性 (Extension):此方法可適用於大量的文件集合。
2. 多功能性 (Versatility):適用於多種的視覺化任務。
3. 互動性 (Interactivity):提供使用者比較直覺的人機介面,讓使用者參與研究與發展過程。
4. 技術 (Techniques)
在文本處理上採用可擴充 (scalable)、高效能 (high-performance) 的演算法;採用協調多視圖 (Coordinated and Multiple Views);即時處理技術。

本研究並且提出文件視覺化有待發展的兩個研究方向,一為文件視覺化的評鑑方法 [62],另一為理論基礎 [63]。


This overview introduces fundamental concepts of and designs for document visualization, a number of representative methods in the field, and challenges as well as promising directions of future development.

Document visualization is a class of the information visualization techniques that transforms textual information such as words, sentences, documents, and their relationships into a visual form, enabling users to better understand textual documents and to lessen their mental workload when faced with a substantial quantity of available textual documents. [1]

And compared with text visualization that aims to visualize information on the text level, document visualization concentrates more on visualizing documents that include attributes and metadata except the core textual contents.

Document visualization has significant advantages over helping people to analyze and control big quantities of textual information in many cases. For example, we can intuitively get access to (1) word frequency or distribution; (2) semantic content and repetition; (3) the topic or topics that define document clusters; (4) the core content of document; (5) similarity among documents; (6) the connections among documents; (7) how content changes over time; and (8) information diffusion or other interesting patterns in social media, as well as improve text searches.

Generally ‘document’ is a textual record or physical form/representation of ‘information’.

The evolving notion of ‘document’ among Jonathan Priest, Otlet, Briet, Sch ¨ urmeyer, and the other documentalists increasingly emphasized whatever functioned as a document rather than traditional physical forms of documents. [2]

And with the development of digital technology, anything exists physically in a digital environment, such as a mail message or a technical report, could be considered as a document.

Documents are often minimally structured and may be rich with attributes and metadata, especially when concentrated in a specific application domain.

We may learn from the good practical guidelines to create an effective user interface for an interactive information visualization tool, as propounded by Ben Shneiderman who suggested in a form of mantra that an effective information visualization tool should follow the principle:
Overview first, zoom and filter, then details on demand. [4]

The mantra is accompanied by a task taxonomy for information visualizations that specifies seven
tasks at a high level of abstraction [4]:
• Overview. Gain an overview of the entire collection.
• Zoom. Zoom in on items of interest.
• Filter. Filter out uninteresting items.
• Details-on-demand. Select an item or group and get details when needed.
• Relate. View relationship among items.
• History. Keep a history of actions to support undo, replay, and progressive refinement.
• Extract. Allow extraction of sub-collections and of the query parameters.

We firstly divide document visualization methods into three main categories:
(1) single document visualization that has more emphasis on individual words and actual single document contents;
(2) document collection visualization that has more emphasis on large document collections, themes and concepts across collection, and how documents are relate to others;
(3) extended document visualization which often deals with comprehensive tasks, involves other attributes beyond the content of documents, and is always applied in specific field, such as social media and search.

In single document visualization, the goal is to quickly understand and absorb core content and text
features. The visualization focuses on words, phrases, semantic relations, and contents.

1. Vocabulary-Based Visualization
Vocabulary is the basic unit of a document. The visualization assists people in understanding words through visual representation of the document vocabulary features, such as word frequency, word distribution, and lexical structure, thereby providing a general idea of contents and features in a document.

Tag Clouds [6,7] and Wordle [8,9] are representative methods mainly visualizing word frequency. They are widely used in the news media and personal home pages. They provide layouts of raw tokens, colored, and sized by the corresponding word frequency within a single document. We may know the main research areas/content discussed in the text by the compact visual form of words.

Recently, some other methods have been proposed, extending the tag/word cloud, such as
parallel tag clouds (PTC), [10] ManiWordle, [11] context preserving dynamic word cloud,[12] visualization of internet discussion with extruded word clouds. [13]

Other examples: TextArc [14], DocuBurst [16].

2. Visualization Based on Semantic Structure

Visualization based on semantic structure usually use entities and their relationships to reveal the semantic content.

Semantic Graphs [19] is a visualization based on the semantic representation of a document in the form of a semantic graph. Firstly, it extracts subject–verb–object for each sentence by the Penn Treebank parse tree. Then, it links the triplets to their corresponding entity, which needs to resolve pronominal anaphors as well as to attach the associate WordNet synset. Thus, the document is summarized with the semantic graph and the list of extracted triplets.

3. Visualization Based on Document Content
Visualization based on document content is not only to search for specific words but also to obtain the characteristics and relations of the contents in the document.

The WordTree visualization provides the representation of both word frequency and context. Size is used to represent frequency of the term or phrase. The root of the tree is a user-selected word or phrase, and the branches represent the contexts in which the word or phrase is used in the document. Users can click on a branch, choose a different search term or re-center the tree. [23]

Martin Wattenberg’s Arc Diagrams [25] is a visualization method that focuses on showing complex patterns of repetition. It is suited to the analysis of highly structured data like musical compositions and less well-structured data like a web page. Repeated subsequences are identified and connected by semicircular arcs. Height of the arcs represents the distance between the subsequences; and thickness of the arcs represents the length of the subsequences.

Document Collection Visualization

Document collection visualization usually intends to reveal the topic or topics that define document clusters, the similarities and differences among documents, and how contents change over time.

1. Visualization of Document Themes

The main goal is to discover one or more specific topics and to reflect the relationships among various topics.

It may be used to find hot disciplines, evolutions, and trends.

The methods, such as ThemeScapes, [26] INSPIRE’s ThemeView, and The Galaxy, [27] all developed by the Pacific Northwest National Laboratory, having less emphasis on the time factor, focus more on characteristics of the document themes at some specific points.

ThemeView uses a 3D terrain map display to represent different themes. The height of a mountain represents the theme’s strength, and the distance between two mountains represents the similarity between the two themes. Keywords are used to distinguish each mountain. [27]

The Galaxy visualization uses a similar approach that themes are visualized as 2D clouds of document points-stars in a theme galaxy (Figure 8(b)). [27]

There are other representations for visualizing documents and topics as nodes in a node-link graph. TopicNets is a web-based system for visual and interactive analysis of large sets of documents using statistical topic models. [28] The main view is a document topic graph which can allow aggregate nodes. The time dimension is represented as a separate visualization, with documents placed chronologically around a broken circle, and connected to related topic nodes which are placed inside the circle.

The methods, such as ThemeRiver [29] and Topic Island, [30] have greater emphasis on the time factor, focusing more on visualizing thematic variations over time within a collection of documents.

ThemeRiver is in the form of axes, with the X-axis representing time and the Y-axis representing different themes. The ‘river’ flows from left to right through time, changing width to portray changes in theme strength of corresponding documents. Rivers of different colors represent different themes, and the width of river (i.e., narrow or wide) indicates the strength (decreasing or increasing) of an individual topic in the associated documents. [29]

2. Visualization of Document Core Content

Visualization of document core content mainly intends to give an overview of a collection of documents without reading them entirely.

Document Cards [35] visualizes large document collections, such as paper collections and news reports, which contain both texts and images to describe facts, methods, or stories. It represents the document’s key content as a mixture of images and important terms, similar to cards in a top trumps game. [35]

The pipeline for creating Document Cards is as follows: firstly, extract the text from the original document, and use a text mining approach to extract the key terms; then go to the phases of image extraction, including image processing and image packing; finally layout the extracted key terms and images to generate the corresponding document cards.

3. Visualization of Changes over Different Versions

Visualization of changes over different versions is used to visualize differences among multiple document versions that are generated over time.

History Flow [36] is designed to show changes between multiple document versions on Wikipedia. It can visualize the process of content changes and the corresponding authors who make the amendments. It also reveals some complex patterns of cooperation and confliction, such as vandalism and repair, anonymity versus named authorship, negotiation, and content stability.

Software visualization [37–39] focuses on visualizing the software development. SeeSoft, [40] Augur, [41] and Advizor [42] are visualizations for code documents. Xia gives visual insight into version control activities, like architectural and coding differences between two software versions. [43] Beagle visualizes changes among different released versions. [44] Spectrograph shows the time and location where changes happen in the system. [45]

4. Visualization of Document Relationships

With gradual increases in document quantity, the concepts and entities within documents become larger and larger, making the analyst’s task of evaluation and sense-making more difficult. Thus, it is quite meaningful to visualize connections among documents. The visualization focuses on the correlation among documents, like the connections among entities across different documents.

Jigsaw [46] is an interactive visualization for document exploration and sense-making, and it supports the analysis of relationships among documents. It visually shows connections between entities in the documents; where entities could be people, places, dates, organizations, and so on. It is suitable to documents describing a set of observations or facts, like news stories and case reports. It provides multiple views and each view provides a different perspective.

There are other methods for visualizing relations among multiple facets. ContexTour [47] presents the relations among conferences, authors, and topics in paper collections. FacetAtlas [48] shows relations among causes, symptoms, treatments, and diagnoses in Google Health documents. PivotPaths [49] visually explores relations of authors, keywords, and citations in academic publications.

5. Visualization of Document Similarity

In many cases of document collection visualizations, the goal is to place similar documents close to each other and dissimilar ones far apart.

The self-organizing map (SOM) [50] is a nonlinear projection method. It expresses complex, nonlinear relationships between high dimensional data items into simple geometric relationships on a 2D display.

When applying to information retrieval, it usually uses map displays. [51] Different colored areas represent different concepts in documents. Size of area indicates its relative importance in collection. Neighboring regions show commonalities in concepts. Dots in regions can represent documents. Additional information can be referred in Xia Lin’s map display [52] and WEBSOM. [53]

With the rise of social media (a textual medium), text streams, such as Twitter posts, are being generated in volumes that grow every day. A large body of research has appeared in recent years. Those works have different focuses and always involve multiple targets, such as dealing with the statistical analysis and presentation of topics or terms, focusing on the emergence of topic events, [33] and visualizing the text messages themselves. [54]

Christian Rohrdantz [55] provides an overview of real-time visualization of streaming text data.

STREAMIT [56] presents a similar visual representation of text streams which applies to news documents.

Whisper [57] fulfills the requirement for tracing information diffusion processes in social media, in a real-time manner.

Search Visualization visualizes the results of search operations. The relatively early approach is TileBars [58] that intends to minimize time and effort for deciding which documents to view in detail.

Susan Havre [59] introduces a graphical method called Sparkler for visually presenting and exploring the results of multiple queries simultaneously.

RankSpiral addresses the problem of how to enable users to visually explore and compare large sets of documents that have been retrieved by different search engines or queries. [60]

We have mainly considered the visualization objects and tasks when classifying document visualization methods. Our classification is considered more acceptable than other classifications (e.g., representations: pixel-based, map-based, tree-based graphs, node-link diagrams, circle graphs, etc.1), since visualization is usually task dependent, and users commonly begin with data and tasks. Actually, each method may belong to different category even under the same classification criteria; and we classify each method according to its key visualization focus (the visualization objects and tasks).

In Table 1, we summarize and compare those methods mainly from four aspects to give readers a brief view.

• Characteristics visualized. The characteristics of a document visualized by the method, as word frequency, semantic relations, content, changes, or connections among documents.

• Principles satisfied. The design principles satisfied, as noted in Document and Document Design section, the seven tasks: 1) Overview; 2) Zoom; 3) Filter; 4) Details-on-demand; 5) Relate; 6) History; 7) Extract.

• Requirements for a document. Document types suitable to the visualization method, i.e., whether the visualization method has special requirements for a document, like document content, structure, etc.

• Main features. Discuss the visualization method’s features, especially the versatility and interactivity.

Despite this, document visualization shares the same pipeline: get the data (a document or documents), transform it into vectors, then run algorithms based on the tasks of interest (i.e., similarity, search, clustering) and generate the visualizations.


Document visualization techniques combine human wisdom and computer graphics, allowing users to efficiently and intuitively browse, explore, and understand the increasing quantity of documents.

1. Extension: Existing methods can be extended to suit for large-scale document collections.
2. Versatility: It is significant to design relatively general visualization models for different tasks within this field, since existing methods always have narrow scope of application due to its pointed direction.
3. Interactivity: It is important to design a more intuitive man–machine interface to improve user’s experience of interaction. Also it is crucial to find some interstices to allow users to participate in researching and developing process, especially the testing period.
4. Techniques:
• Algorithms. Develop and adopt scalable, high-performance algorithms for text processing, such as text summary and clustering.
• Parallel processing technology. With the adoption and popularity of Coordinated and Multiple Views (CMV), a visualization system usually includes multi-views.
• Real-time processing technology.

1. Evaluation Many document visualization methods or even information visualization methods lack a quantitative measurement which can indicate the overall quality, novelty, uncertainly, and other evaluative metrics. More recently, there exist more and more publications that reflect upon current practices in visualization evaluation. In fact, the BELIV workshop was created as a venue for researchers and practitioners to ‘explore novel evaluation methods, and to structure the knowledge on evaluation in information visualization around a schema’. [62]

2. Theoretical Foundations The 2007 Dagstuhl Workshop identified collaborative information visualization with theory building as major directions for future development. [63]

2015年12月18日 星期五

Kucher, K., & Kerren, A. (2015). Text Visualization Techniques: Taxonomy, Visual Survey, and Community Insights. In 8th IEEE Pacific Visualization Symposium (PacificVis' 15), Hangzhou, China (pp. 117-121). IEEE Computer Society.

Kucher, K., & Kerren, A. (2015). Text Visualization Techniques: Taxonomy, Visual Survey, and Community Insights. In 8th IEEE Pacific Visualization Symposium (PacificVis' 15), Hangzhou, China (pp. 117-121). IEEE Computer Society.

近年來由於可以取得大量而多樣的文本資料和採用文本處理演算法等原因,研究人員對文本視覺化(text visualization)與視覺性的文本解析(visual text analytics)的研究興趣增加。本研究針對文本視覺化技術提出一個互動的視覺調查(visual survey)。並且利用此次調查的資料,分析文本視覺化的現況,比較研究使用的各種分析與視覺化技術,以及分析有關研究者的資訊,以提供搜尋相關研究、探索次領域(subfield)以及獲得研究趨勢的洞察等目的

本研究採納前人的研究,將文本視覺化技術,以分析任務(analytic tasks)、視覺化任務(visualization tasks)、資料領域(data domain)以及資料來源(data source)、資料性質(data property)、視覺化的維度(visualization dimensionality)、視覺化的呈現(visualization representation)、視覺化的排列方式(visualization alignment)等面向,建立分類架構(taxonomy)。


分析任務是指使用者採用文本視覺化技術預期達到的主要目的,這些分類包括:
1. 文本摘要 (Text Summarization) / 主題分析 (Topic Analysis) / 實體抽取 (Entity Extraction)
2. 言談分析 (Discourse Analysis):文本或對話轉錄(conversation transcript)裡流動的語言學分析。
3. 情感分析 (Sentiment Analysis)
4. 事件分析 (Event Analysis)
5. 趨勢分析 (Trend Analysis) / 樣式分析 (Pattern Analysis)
6. 詞法/語法分析 (Lexical / Syntactical Analysis)
7. 關係/連結分析 (Relation / Connection Analysis)
8. 翻譯/文本比對分析 (Translation / Text Alignment Analysis)

視覺化任務則是由文本視覺化技術所支援的較基層呈現與互動任務,包括:

1. 自動凸顯/建議興趣區 (Region of Interest)
2. 群集 (Clustering) / 分類 (Classification / Categorization)
3. 比較 (Comparison)
4. 概觀 (Overview)
5. 監視 (Monitoring)
6. 瀏覽 (Navigation) / 探索 (Exploration)
7. 對於不確定的對策 (Uncertainty Tackling)

資料領域,包括

1. 線上社交媒體 (Online Social media)
2. 通訊 (Communication)
3. 專利 (Patents)
4. 評論 (Reviews) / 病歷 (Medical Records)
5. 文學作品 (Literature) / 詩 (Poems)
6. 科學文章 (Scientific Articles) / 論文 (Papers)
7. 社論媒體 (Editorial Media)

資料來源有單一文件 (Document) [33]、語料庫 (Corpora) [25]以及 串流文本 (Streams) [19];特殊的資料性質包括地理空間 (Geospatial) [11]、時間序列 (Timeseries) [14] 以及網路 (Networks) [6];視覺化的再現包括下列項目:折線圖 (Line Plot) / 河流圖 (River) [9, 18]、像素 (Pixel) / 面積 (Area) / 矩陣 (Matrix) [13, 7, 4]、節點-連結 (Node-Link) [32]、雲 (Clouds) / 銀河 (Galaxies) [1, 3]、地圖 (Maps) [34]、文本 (Text) [26]與形符 (Glyph) / 圖標 (Icon) [28, 10];排列則包括了輻射狀 (Radial) [35]、線性 (Linear) / 平行線 (Parallel) [8] 以及測標依賴 (Metric-dependent) [22]。

本研究指出有超過一半(56%)的文本視覺化利用主題模型(topic modeling)技術,資料來源方面大多數支援語料庫(70%),並且許多支援時間相關的資料(43%),而視覺再現方面主題以二維(2-D)為主,僅有極少數的研究以三維(3-D)的方式呈現,約占所有研究的4%。

文本視覺化的前五位主要作者為Daniel A. Keim (17 筆)、Shixia Liu (12 筆)、Christian Rohrdantz (9 筆)、Daniela Oelke (7 筆)和 Huamin Qu (7 筆)。將作者依據他們的合著關係建立研究者合作網路圖後,觀察網路圖的相連成分,可以發現大部分是獨立的小群體,最大的成分上共有106位作者,並且在這個成分上的兩個主要集群為University of Konstanz和Microsoft Research Asia等兩個研究團隊,Daniel A. Keim 和 Shixia Liu分別為集群的中心,並且他們二位也是網路圖上中介中心性最高的節點。雖然在本研究蒐集的資料上,這兩位作者之間並沒有直接的合作關係,但他們都曾與中介中心性第三高的兩位作者Dongning Luo 和 Jing Yang合作。


In this paper, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.

The interest for text visualization and visual text analytics has been increasing for the last ten years. The reasons for this development are manifold, but for sure the availability of large amounts of heterogeneous text data (caused by the popularity of online social media) and the adoption of text processing algorithms (e.g., for topic modeling) by the InfoVis and Visual Analytics communities are two possible explanations.




Analytic Tasks
these items are critical to the main analysis goals that users expect to achieve when employing a text visualization technique.

1. Text Summarization / Topic Analysis / Entity Extraction

2. Discourse Analysis
the linguistic analysis of the flow of text or conversation transcript.

3. Sentiment Analysis
for techniques related to the analysis of sentiment, opinion, and affection.

4. Event Analysis
deal with the extraction of events from the text data or involve visualization of text in some different manner

5. Trend Analysis / Pattern Analysis
both automated trend analysis and manual investigation directed at discovering patterns in the textual data.

6. Lexical / Syntactical Analysis

7. Relation / Connection Analysis

8. Translation / Text Alignment Analysis

Visualization Tasks
lower-level representation and interaction tasks that are supported by the text visualization techniques.

1. Region of Interest
the automatic highlighting/suggestion of data items/regions that could be of interest to the user for more detailed investigation

2. Clustering / Classification / Categorization

3. Comparison

4. Overview
both techniques that provide “the big picture” by displaying a significant portion of the data set as well as techniques which use special aggregated representations to provide overview while reducing the visual complexity

5. Monitoring

6. Navigation / Exploration

7. Uncertainty Tackling


Domain

1. Online Social media

2. Communication

3. Patents

4. Reviews / (Medical) Records

5. Literature / Poems

6. Scientific Articles / Papers

7. Editorial Media

Data sources include the following self-evident items: Document [33], Corpora [25], and Streams [19].

The special data properties include Geospatial [11], Timeseries [14], and Networks [6].

Representation includes the following items: Line Plot / River [9, 18], Pixel / Area / Matrix [13, 7, 4], Node-Link [32], Clouds / Galaxies [1, 3], Maps [34], Text [26], and Glyph / Icon [28, 10].

Alignment, i.e., layout, includes Radial [35], Linear / Parallel [8], and Metric-dependent [22].

As displayed in the table, our proposed taxonomy includes most of the categories except for two: we believe that the underlying data representation (e.g., bag-of-words vs. language model [30] or whole text vs. partial text [24]) is more relevant to the underlying computational methods than to observable visualization techniques.

And the same naturally holds for data processing methods (e.g., the specification of involved MDS methods [2]) that are partially covered by other categories in our taxonomy, for instance, the analytic task of topic analysis implies the usage of corresponding computational methods.

Using the data collected for the survey, we have been able to analyze the general state of the text visualization field, to compare the usage of various analysis and visualization techniques (with regard to our taxonomy), and to analyze the information about researchers in this field.

According to our current set of entries, the trend for rapid increase of text visualization techniques started around 2007.

With regard to category statistics (cf. Fig. 4), there is an obvious interest for tasks related to topic modeling (56% of all entries).

The majority of the techniques support corpora as data sources (70% of all entries), and a lot of them support time-dependent data (43% of all entries).

Another result—which is probably expected—is that only less than 4% of all entries use 3-dimensional visual representations.

We have also taken a look at the authorship statistics for the current data set. The top five authors with regard to number of techniques are Daniel A. Keim (17 entries), Shixia Liu (12 entries), Christian Rohrdantz (9 entries), Daniela Oelke (7 entries), and Huamin Qu (7 entries).

As seen in Fig. 5, the majority of author nodes are included into isolated connected components of small sizes (less than 10 nodes) while there is a big connected component with 106 nodes present in the graph.

The two major clusters in that component represent the research groups from the University of Konstanz and Microsoft Research Asia with Daniel A. Keim and Shixia Liu as cluster center nodes.

Shixia Liu and Daniel A. Keim happen to have the 1st and the 2nd largest betweenness values in the graph, respectively. While these two researchers have no direct collaboration with regard to our data set, they both have collaborated with Dongning Luo and Jing Yang who both share the 3rd largest betweenness value.

2015年4月21日 星期二

Tseng, Y.-H. and Tsay, M.-Y. (2013) Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR. Scientometrics, 95, 503-528. doi: 10.1007/s11192-013-0964-1.

Tseng,  Y.-H. and Tsay, M.-Y. (2013) Journal clustering of library and information science for subfield delineation using the bibliometric analysis toolkit: CATAR. Scientometrics, 95, 503-528. doi: 10.1007/s11192-013-0964-1.

近幾十年來,發展出許多科學計量分析技術,包括為了群集(clustering)書目資料所需的各種相似度(similarity)計算技術,如共被引(co-citation)、書目耦合(bibliographic coupling)與詞語共現分析(co-word analysis),這些技術的比較分析可參見Yan and Ding (2012)。並且有很多可以在網路上自由下載使用的軟體工具製作並包裝這些技術,提供科學計量分析應用,知名的軟體工具如CiteSpace (Chen 2006, Chen et al. 2010)、Sci2 Tool (Sci2 Team 2009)、VOSviewer (Van Eck and Waltman 2010)、BibExcel (Persson 2009)及Sitkis (Schildt and Mattsson 2006),這部分的分析則可參見Cobo et al. (2011)。本研究包含兩個部分:提出包含一系列利用書目計量資訊進行群集與映射(mapping)技術的科學計量分析軟體工具集 CATAR,並且將此工具集應用於圖書資訊學(library and information science, LIS)領域後,希望能夠利用期刊群集的結果,確認與分析次領域,以及建議適合研究評估(research evaluation)用途的LIS期刊集合。

Åström (2002)從領域概念的視覺化研究獲得一個結論:期刊的選擇確實影響研究領域如何被知覺與定義,也就是研究領域的界定(delineation)與期刊的選擇有密切關係。已經有許多的研究對圖書資訊學進行次領域界定,而這些研究大多參考ISI的JCR主題分類中與圖書資訊學最相關的類別IS&LS(Information Science and Library Science)。IS&LS類別下並不只包含圖書資訊學的相關期刊,這個類別涵蓋兩個密切相關的領域資訊科學(Information Science)和圖書館學(Library Science),此一範圍與圖書資訊學有些微不同。根據Leydesdorff (2008),JCR主題分類以期刊的題名、引用模式(citation patterns)等等做為標準進行分類,但是這個分類結果與從資料庫本身的引用資料所產生的網路上的主要成分(principal components)得到的分類結果並不十分相符。因此次領域界定研究大多經過人為的挑選做為分析資料的期刊,並沒有完整收錄IS&LS主題下的所有期刊。

進行次領域界定時常使用的技術包括:利用共被引分析比較一對項目,利用凝聚式階層群集(agglomerative hierarchical clustering, AHC)將項目分群產生樹狀圖(dendrogram),利用多維尺度(multi-dimensional scaling, MDS)產生視覺化的二維或三維映射圖。若干重要的研究如:Åström (2002)從圖書資訊學重要期刊中選取1135篇出版在1998到2000年的文章,利用BibExcel軟體工具進行作者共被引(author co-citation)以及關鍵詞共現分析,並產生MDS映射圖,52位高被引作者的共被引產生三個群集:"硬"資訊檢索(hard information retrieval)、"軟"資訊檢索(soft information retrieval)以及書目計量學(bibliometrics),47個較常出現的關鍵詞則分為圖書館學(library science,LS)、資訊檢索(information retrieval,IR)及書目計量學。Åström (2002)認為作者共被引分析沒有出現圖書館學的原因可能與圖書館學研究的出版管道有關,如果引用的資料像是書籍或地區期刊沒有出現在JCR,圖書館學作者便無法出現在引用為基礎的排名上。Åström (2007)對55種在JCR 2003主題類別下的期刊,選擇21種圖書資訊學相關期刊的13605篇文章進行文件共被引分析,在從1990到2004年的三個時段發現圖書資訊學可分為資訊計量學(informetrics)和資訊搜尋與檢索(information seeking and retrieval)兩個穩定的次領域,而隨著全球資訊網的普及,網路計量學(webometrics)在兩個次領域上都成為主要的研究議題。Jassen et al. (2006) 對2002到2004年五種圖書資訊學相關期刊的938篇文章,應用一系列的全文分析技術以及MDS和AHC,將938篇文章分為六個群集:兩個群集與書目計量學有關、一個群集為IR、一個包含一般議題、另兩個較小但愈來愈重要的群集分別是網路計量學和專利分析(patent analysis)。Moya-Anegon et al. (2006)從24種較有影響力的期刊中選擇17種期刊,排除將資訊科學(information science, IS)應用到特定技術或知識領域(例如:醫學、地理學、電訊傳播等),從17種期刊引用的參考文獻,對77位最常被引用的作者和73篇最常被引用的期刊進行共被引分析,映射使用的技術包括MDS和AHC以及自組織映射圖(self-organizing map)。作者共被引分析的結果產生六個次領域:科學計量學、引用分析、書目計量學、"軟"(認知導向)資訊檢索、"硬"(演算法導向)資訊檢索以及傳播理論(communication theory)。而期刊共被引分析的結果則有四個群集:IS、LS、科學研究(science studies)以及管理學(management)。在期刊共被引分析的科學研究大致上可以對應為作者共被引分析的科學計量學、引用分析、書目計量學,IS為"軟"資訊檢索和"硬"資訊檢索。如Åström (2002)同樣的原因,LS沒在作者共被引分析的結果當中。Waltman et al. (2011)以JASIST為種子,選擇與該期刊共被引較多的期刊,連JASIST共48種,進行期刊的書目耦合(bibliographic coupling)分析,並且利用VOSviewer呈現視覺化結果,共分為LS、IS以及科學計量學等3個次領域。Milojevic et al. (2011)使用詞語共現分析探討1998到2007年出版的16種期刊上的10344篇文章,16種期刊根據Nisonger and Davis (2005) 的研究所挑選,分析100個文章題名上最常出現的詞語,進行共現分析,並以AHC歸類,結果三個主要群集為LS、IS以及書目計量學/科學計量學。

Åström (2002)以關鍵詞的共現分析所得到的結果包括LS次領域,但作者共被引分析所得到的映射圖上並沒有產生這個次領域。Moya-Anegon et al. (2006)的期刊共被引分析與作者共被引分析也略有不同,期刊共被引分析的結果上有作者共被引分析沒有的LS和管理學兩個次領域,反之,作者共被引分析的結果上則可以發現期刊共被引分析沒有的傳播學理論(communication theory)。一般認為這和作者引用的行為有關,LS作者的引用次數大多沒有達到分析的門檻,因此無法在上述兩個研究的作者共被引分析結果上呈現。

Ni et al. (2012)從JCR的IS&LS類別下的61種期刊,排除3種非英語的期刊,將選取的58種期刊進行場域-作者耦合(venue-author coupling)、期刊共被引分析、詞語共現分析、期刊連結(journal interlocking)等四種分析。分析的結果再進行MDS與AHC分析,四種方式所得到一致的次領域包括:管理資訊系統(managment information systems, MIS)、IS、LS和特殊化群集(specialized clusters),並且在四種方法所得到MDS映射的圖形上都可以發現MIS與其他群集分離,Ni and Ding (2010)與Ni and Sugimoto (2011)建議JCR上的圖書資訊相關期刊應進行適當的重組。

本研究(Tseng and Tsai 2013)應用的資料範圍為2000到2004與2005到2009在Web of Science 的Journal Citation Report中 Information Science & Library Science (IS&LS)主題分類下的所有期刊,在前期(2000~2004年)共50種,後期(2005~2009年)共66種。本研究的分析程序採用Borner et al. (2003)整理的一般工作流程,步驟包括:1) 資料蒐集(data collection);2)文本分段(text segmentation);3)相似性計算(similarity computation);4)多階段群集(multi-stage clustering);5)群集標名(clustering labeling);6)視覺化(visualization);7)面向分析(facet analysis)。這些步驟中所需的技術都已經整合到軟體工具CATAR(Content Analysis Toolkit for Academic Research, http://web.ntnu.edu.tw/~samtseng/CATAR/)上。在計算文件間的相關性時,本研究以一種期刊做為一個文件,所有論文引用的期刊做為文件的特徵,然後利用Dice係數(Salton 1989)計算期刊相似性,例如兩種期刊X與Y,R(X)與R(Y)分別是它們引用的期刊,它們之間的相似性計算為Sim(X, Y) = 2 ∙ |R(X)∩R(Y)|/(|R(X)|+R(Y)|)。也就是利用書目耦合計算期刊之間的相似性。期刊的群集則是利用完全連接階層群集法(complete-linkage hierarchical clustering)。首先將每個文件視為一個群集,然後將一對最相似的群集合併起來,產生一個較大的群集,然後重複進行上面的步驟,而兩個群集的相似性定義為兩個群集間最小的文件相似性,如果相似性超過某個預先設定的閾值,便將兩個群集合併,一直到無法再產生合併為止。此外,本研究採用Silhouette指標(Ahlgren and Jarneving 2008; Rousseuw 1987; Jassen et al. 2006)。

此一研究的資料包含JCR的IS&LS主題下的期刊,分為2000-2004年與2005-2009年兩個時期,前一個時期包含50種期刊,9546筆論文資料;後一時期則有66種期刊,11471筆論文資料。從群集結果的樹狀圖(dendrogram)和MDS映射的結果顯示,IS&LS主題下的期刊在兩個時期都有IR、MIS、科學計量學、學術圖書館(academic library)、醫學圖書館(medical library)、館藏發展(collection development),以及開放取用(open access)和地區圖書館(regional library)兩個後期出現並且較小的群集。並且MIS群集的期刊在知識基礎(intellectual base)上與IS&LS主題的其他期刊分離,表示這群集下的期刊具有較特殊的引用模式。本研究以期刊的書目耦合進行分析,從期刊知識基礎(intellectual base)得到MIS群集與其他分離的研究結果,與Ni et al. (2012)利用期刊共被引分析、期刊連結、術語使用(terminology usage)和合著(co-authorship)研究等不同方法的研究結果相同,這也為許多探討圖書資訊學認知結構的研究認為不應將MIS相關期刊與其他期刊包含在ISI的同一個主題IS&LS下,在進行分析時需要排除MIS相關期刊提供了佐證(Larivière et al. 2012)。此外,並且以多樣性指標(diversity index)分析群集特性,揭露出某些次領域具有地區(regional)特性。

2015年4月15日 星期三

Moya-Anegón, F. de, Vargas-Quesada, B., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.J., & Herrero-Solana, V. (2007). Visualizing the marrow of science. Journal of the American Society for Information Science and Technology, 58(14), 2167–2179.

Moya-Anegón, F. de, Vargas-Quesada, B., Chinchilla-Rodríguez, Z., Corera-Álvarez, E., Munoz-Fernández, F.J., & Herrero-Solana, V.(2007). Visualizing the marrow of science. Journal of the American Society for Information Science and Technology, 58(14), 2167–2179.

由於一般認為將領域之間的關係表示為圖形,通過考慮這些關係的可能性能夠提供許多資訊,不論對新進人員或專家皆有助於理解與分析,因此對這方面方法與工具的需求逐漸提高。過去的研究大多以期刊為分析單位,產生所有科學研究領域的科學映射圖。例如Leydesdorff (2004a, 2004b)使用雙重連結成分(biconnected components)的圖形分析演算法,將JCR 2001的科學研究進行分類。Boyack, Klavans, and Börner (2005)則應用了8種不同的期刊相似性測量7121種SCI和SSCI期刊,並採用VxOrd產生科學映射圖。Samoylenko, Chao, Liu, and Chen (2006)建構科學期刊的最小生成樹(minimum spanning trees),他們使用的資料是SCI 1994到2001的資料。本研究提出一個將ISI (Institute of Scientific Information)類別繪製成科學映射圖的方法,這個方法利用根據類別間的共被引資訊建構類別間的連結,以尋徑網路(PathfinderNetwork)縮減不重要的連結,然後以Kamada-Kawai方法決定節點在圖上的布局(layout),最後利用因素分析(factor analysis)進行結構確認。本研究和先前的研究都是針對類別利用共被引資訊呈現科學映射圖。以類別為分析單位在代表上足夠明確,並且比起較小的單位,這種方式對非專家使用者(nonexpert user)較具有資訊且使用者友善。Moya-Anegón et al. (2004)針對西班牙科學研究領域的視覺化,Moya-Anegón et al. (2005)則進一步利用科學映射圖比較英國、法國和西班牙三個國家的科學研究領域。本研究依循Börner, Chen, and Boyack (2003)提出的知識領域映射流程。使用的資料為7585種ISI期刊,ISI的類別共有219個,但扣除多學科科學後(Multidisciplinary Sciences),採用的類別共218個。利用共被引計算期刊相似性的方式為

Cc(ij)為期刊i和期刊j共被引次數,c(i)和c(j)則分別是期刊i和期刊j被引用次數。然後以尋徑網路和Kamada-Kawai方法繪製網路圖,經過尋徑網路處理後,有較多連結的節點具有較重要的地位。而尋徑網路是一種以型態為主的方法,與以群集為主的因素分析彼此間可以互補,因素分析可以識別、界定與定名科學映射圖上呈現的主題區域,而尋徑網路則負責讓使主題區域更加明顯,將類別分組成束,並顯示連接不同顯著類別的路徑,以及總體的型態結構。。最後總計共分析出35個因素,通過陡坡考驗(scree test)則有16個。科學映射圖上的類別可以分為三個群集:醫學與地球科學、基礎與實驗科學以及社會科學。

This study proposes a new methodology that allows for the generation of scientograms of major scientific domains, constructed on the basis of cocitation of Institute of Scientific Information categories, and pruned using PathfinderNetwork, with a layout determined by algorithms of the spring-embedder type (Kamada–Kawai), then corroborated structurally by factor analysis.

We present the complete scientogram of the world for the Year 2002.

This need arises from the general conviction that an image or graphic representation of a domain favors and facilitates its comprehension and analysis, regardless of who is on the receiving end of the depiction and whether a newcomer or an expert.

Science maps can be very useful for navigating around in scientific literature and for the representation of its spatial relations (Garfield, 1986). They are optimal means of representing the spatial distribution of the areas of research while also offering additional information through the possibility of contemplating these relationships (Small & Garfield, 1985).

From a general viewpoint, science maps reflect the relationships between and among disciplines; but the positioning of their tags clues us into semantic connections while also serving as an index to comprehend why certain nodes or fields are connected with others.

Moreover, these large-scale maps of science show which special fields are most productively involved in research—providing a glimpse of changes in the panorama—and which particular individuals, publications, institutions, regions, or countries are the most prominent ones (Garfield, 1994).

It is a tool in that it allows the generation of maps, and a method in that it facilitates the analysis of domains, by showing the structure and relations of the inherent elements represented. In a nutshell, scientography is a holistic tool for expressing the discourse of the scientific community it aspires to represent, reflecting the intellectual consensus of researchers on the basis of their own citations of scientific literature.

In Moya-Anegón et al. (2004), we ventured forth with a historic evolution of scientific maps from their origin to the present, and proposed ISI-JCR category cocitation for the representation of major scientific domains. Its utility was demonstrated by a visualization of the scientific domain of geographical Spain for the Year 2000.

Since then, other works related with the visualization of great scientific domains have appeared; however, all use journals as the unit of analysis, with the exception of a study based on the cocitation of categories (Moya-Anegón et al., 2005), comparatively focusing on three geographic domains (England, France, and Spain).

In contrast, Leydesdorff (2004a, 2004b) classified world science using the graph-analytical algorithm of biconnected components in combination with JCR 2001.

Boyack, Klavans, and Börner (2005) applied eight alternative measures of journal similarity to a dataset of 7,121 journals covering over 1 million documents in the combined Science Citation and Social Science Citation Indexes, to show the first global map of science using the force-directed graph layout tool VxOrd.

Samoylenko Chao, Liu, and Chen (2006) proposed an approach through the construction of minimum spanning trees of scientific journals, using the Science Citation Index from 1994 to 2001.

In processing and depicting the scientific structure of great domains, we further developed a methodology that follows the flow of knowledge domains and their mapping as proposed by Börner, Chen, and Boyack (2003).

Because ISI assigns each journal to one or more subject categories, to designate a subject matter (i.e., ISI category) for each document, we also downloaded the Journal Citation Report (JCR; Thomson Corporation, 2005a), in both its Science and Social Sciences editions, for 2002.

The downloaded records were exported to a relational database that reflects the structured information of the documents. This new repository contained nearly 1 million (N = 901,493) source documents: articles, biographical items, book reviews, corrections, editorial materials, letters, meeting abstracts, news items, and reviews that had been published in 7,585 ISI journals (N = 5,876 + 1,709). These were classified in a total of 219 categories, altogether citing 25,682,754 published documents.

As informational units, they are, in themselves, sufficiently explicit to be used in the representation of all disciplines that make up science in general. These categories, in combination with the adequate techniques for the reduction of space and the representation of the information to construct scientograms of science or of major scientific domains, prove much more informative and user friendly for quick comprehension and handling by nonexpert users than those obtained by the cocitation of smaller units of cocitation.

For these reasons, we used the 219 categories of the JCR 2002 as units of measure, with the exception of “Multidisciplinary Sciences.” ... The maximum number of categories with which we worked, then, was 218.

In light of our previous experience (Moya-Anegón et al., 2004, 2005), we use cocitation as the similarity measure to quantify the relationship existing between each one of the JCR categories.

Therefore, after a number of trials, we arrived at the conclusion that using tools of Network Analysis, the best visualizations are those obtained through raw data cocitation as the unit of measure. Yet, it also was necessary to reduce the number of coincident cocitations to enhance pruning algorithm yield. Therefore, to those raw data values we added the standardized cocitation value. In this way, we could work with raw data cocitation while also differentiating the similarity values between categories with equal cocitation frequencies. The key was a simple modification of the equation for the standardization of the degree of citation proposed by Salton and Bergmark:




where CM is cocitation measure, Cc is cocitation frequency, c is citation, and i and j are categories.

Over the history of the visualization of scientific information, very different techniques have been used to reduce n-dimensional space. Either alone or in conjunction with others, the most common are multidimensional scaling, clustering, factor analysis, self-organizing maps, and PathfinderNetworks (PFNET).

In our opinion, PFNET with pruning parameters r = ∞, and q = n − 1 is the prime option for eliminating less significant relationships while preserving and highlighting the most essential ones, and capturing the underlying intellectual structure in a economical way.

Although PFNET has been used in the fields of Bibliometrics, Informetrics, and Scientometrics since 1990 (Fowler & Dearhold, 1990), its introduction in citation was due to the hand of Chen (1998, 1999), who introduced a new form of organizing, visualizing, and accessing information. The end effect is the pruning of all paths except those with the single highest (or tied highest) cocitation counts between categories (White, 2001).

The spring embedder type is most widely used in the area of documentation, and specifically in domain visualization. Spring embedders begin by assigning coordinates to the nodes in such a way that the final graph will be pleasing to the eye (Eades, 1984). Two major extensions to the algorithm proposed by Eades (1984) have been developed by Kamada and Kawai (1989) and Fruchterman and Reingold (1991).

While Brandenburg, Himsolt, and Rohrer (1995) did not detect any single predominating algorithm, most of the scientific community goes with the Kamada–Kawai algorithm. The reasons upheld are its behavior in the case of local minima, its capacity to minimize differences with respect to theoretical distances in the entire graph, good computation times, and the fact that it subsumes multidimensional scaling when the technique of Kruskal and Wish (1978) is applied.

We can effortlessly see which are the most important nodes in terms of the number of their connections and, in turn, which points act as intermediaries with other lines, as hubs or forking points.

Whereas factor analysis is a clustering-oriented procedure, PFNET is topology oriented. Yet, they are extremely valuable as complements in the detection of the structure of a scientific domain.

Thus, factor analysis is responsible for identifying, delimiting, and denominating the great thematic areas reflected in the scientogram.

Meanwhile, PFNET is in charge of making the subject areas more visible, grouping their categories into bunches, and showing the paths that connect the different prominent categories, and finally, the overall topology of the domain.

Factor analysis identifies 35 factors in the cocitation matrix of 218 × 218 categories of world science 2002. Through the scree test we extracted 16, which we tagged using the previously explained method; these accumulate 70.2% of the variance (Table 1)

The number of categories included in at least one factor is 195. Twenty-three were not included in any factor (Table 2), and 25 belonged to two factors simultaneously (Table 5).

That is, a category or thematic area occupying a central position in the scientogram will have a more general or universal nature in the domain as a consequence of the number of sources it shares with the rest, contributing more to scientific development than those with a less central position.

The more peripheral the situation of a category or subject area, the more exclusive its nature, and the fewer the sources it will appear to share with other categories; accordingly, the lesser its contribution to the development of knowledge through scientific publications.

An intermediary position favors the interconnection of other categories or thematic areas. 

This broad interpretation of our scientograms not only explains the patterns of cocitation that characterize a domain but also foments an intuitive way for specialists and nonexperts to arrive at a practical explanation of the workings of PFNET (Chen & Carr, 1999).

From a macrostructural point of view, we can distinguish three major zones.

In the center is what we could call Medical and Earth Sciences, consisting of Biomedicine, Psychology, Etiology, Animal Biology & Ecology, Health Care & Service, Orthopedics, Earth & Space Science, and Agriculture & Soil Sciences.

To the right, we can see some other basic and experimental sciences: Materials Sciences & Physics, Applied; Engineering; Computer Science & Telecommunications; Nuclear Physics & Particles & Fields; and Chemistry.

To the left is the neighborhood of the social sciences, with Applied Mathematics, Business, Law, and Economy, and Humanities.

On one hand, it offers domain analysts the possibility of seeing the most essential connections between categories of given domain.

On the other hand, it allows us to see how these categories are grouped in major thematic areas, and how they are interrelated in a logical order of explicit sequences.

2015年4月14日 星期二

Pudovkin, A.I., & Garfield, E. (2002). Algorithmic procedure for finding semantically related journals. Journal of the American Society for Information Science and Technology, 53(13), 1113–1119.

Pudovkin, A.I., & Garfield, E. (2002). Algorithmic procedure for finding semantically related journals. Journal of the American Society for Information Science and Technology, 53(13), 1113–1119.

本研究嘗試利用論文的引用做為參數計算期刊之間的相關因素(relatedness factor),根據計算出來的相關因素找到與目標期刊意義上最相似的期刊。傳統的分類仰賴於根據主觀分析,主觀分析是根據某個或某些特定的分類原因,例如ISI期刊索引報告(Journal Citation Reports,JCR)上的期刊分類便是由經驗法則(heuristic)的主觀方式產生。JCR的作法是在類別建立之後,在同一時間,將新的期刊根據它的相關引用資料進行目測,指定類別;當類別成長,便將類別再細分。除此以外對於個別期刊的分類,也有使用一個未被發表的演算法--Hayne-Coulson algorithm,這個演算法將任何特定的期刊群組做為一個大型期刊(macro-journal),然後產生引用與被引用的期刊資料。在大多數的情況下,這種主觀分析已經足夠,但在一些研究領域中,它被認為是過於粗略而不足並且也受限於與時間的不確定,此外也無法讓使用者可以快速了解哪些期刊是最密切相關的。因此,引進引用索引(citation indexes)與的量化方法被提出來解決這些問題。JCR對每種期刊根據它的引用關係提供了一組最密切相關的期刊,也就是它引用最多的期刊以及引用它最多的期刊,Pudovkin & Garfield (2002)認為這是極為有用並且提供了一種原始的分類,然而由於每種期刊的論文數量不同,使得只能夠得到期刊間關係的淺層感知。因此他們提出了一種期刊間相關因素的測量方式:假定Ri>j表示期刊i和j之間的相關因素,定義Ri>j等於Hi>j * 106 / (Papj * Refi),此處Hi>j是當年度期刊i引用期刊j的次數,Papj與Refi分別是期刊j當年發表的論文數以及期刊i當年論文的參考文獻總數。上述的定義需要注意的是期刊本身的相關因素也許比它對其他期刊的相關因素來得小。此外,為了使兩種期刊A和B之間的相關因素對稱,所以本研究採用RA>B與RB>A中最大的一個,也就是定義RA&Bmax = max(RA>B, RB>A)。本研究以基因與遺傳學領域的核心期刊Genetics為例,研究結果顯示這種根據期刊論文數量加權的相關因素計算方式在發現相關期刊上的效果比未加權的方式來得好,這種方式可以發現原先未被歸入JCR的"Genetics & Heredity"類別但明顯是遺傳學相關的期刊,也可以發現原本歸入這個類別但內容較不相關的期刊。

Using citations, papers and references as parameters a relatedness factor (RF) is computed for a series of journals. Sorting these journals by the RF produces a list of journals most closely related to a specified starting journal.

The method appears to select a set of journals that are semantically most similar to the target journal.

Traditional classification relies on subjective analysis which for one reason or another proves inadequate and is subject to the vagaries of time.

Quantitative methods have been proposed for overcoming these problems. This was greatly facilitated with the introduction of citation indexes in the 1960's and the later introduction of the ISI Journal Citation Reports.

JCR reports inter-journal citation frequencies for thousands of journals. .... Journals are assigned to categories by subjective, heuristic methods.

One of the referees asked for a description of the procedures used by ISI in establishing journal categories for JCR. ... This method is “heuristic” in that the categories have been developed by manual methods started over 40 years ago. Once the categories were established, new journals were assigned one at a time. Each decision was based upon a visual examination of all relevant citation data. As categories grew, subdivisions were established. Among other tools used to make individual journal assignments, the Hayne-Coulson algorithm is used. The algorithm has never been published. It treats any designated group of journals as one macrojournal and produces a combined printout of cited and citing journal data.

In many fields these categories are sufficient but in many areas of research these “classifications” are crude and do not permit the user to quickly learn which journals are most closely related.

JCR provides, for each journal, a set of its most closely related journals based on citation relationships. These are the journals it cites most heavily (cited journals) and also the journals which cite it most often (citing journals). These are extremely useful and provide a crude classification, but unfortunately due to the variations in the sizes of journals one only obtains a superficial perception of the relatedness between two or more specific journals.

We have illustrated the procedure using one core journal in the field of genetics and heredity, the well-known Genetics, published by the Genetics Society of America.

Let journal relatedness of two journals, “i” and “j” be symbolized by Ri>j = Hi>j * 106 / (Papj * Refi), where Hi>j is the number of citations in the current year from journal “i” to journal “j” (to papers published in “j” in all years of ‘j’), Papj and Refi are the number of papers published and references cited in the j-th and i-th journals in the current year.

If we consider a pair of journals, A and B, there may be two indexes: RA>B and RB>A. These can be very different.

It is noteworthy that the citation relatedness of a journal to itself (that is “self-relatedness”) may be lower than its relatedness to some other journals.

Now it is suggested we use the larger of them, RA&Bmax = max(RA>B, RB>A), which we shall call the relatedness factor (RF).

An important feature of the suggested approach is the calculation of SPECIFIC citation relatedness, that is, the new indexes take into consideration the sizes of citing (through the number of references) and cited (through the number of published papers) journals.

The new algorithmic approach enables one to find thematically related journals out of a multitude of journals. ... Weighting citation data by journal size allows identifying journals that are similar in content better than unweighted raw citation data.

In the case of the starting journal Genetics the method identified those journals which are significantly genetic in content, but were not included in the “Genetics & Heredity” category of the JCR. ... Journals included in the “G & H” category are rather heterogeneous in content. Some are highly related to Genetics, while others, as for example journals on medical genetics are poorly related to its content.

JCR has become an established world wide resource but after two or more decades it needs to reexamine its methodology for categorizing journals so as to better serve the needs of the research and library community.

2015年4月9日 星期四

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

Rafols, I., & Leydesdorff, L. (2009). Content‐based and algorithmic classifications of journals: Perspectives on the dynamics of scientific communication and indexer effects. Journal of the American Society for Information Science and Technology, 60(9), 1823-1835.

本研究比較兩種以內容為基礎的期刊分類以及兩種以演算法為基礎的期刊分類。兩種以內容為基礎的期刊分類分別是ISI的主題分類(Subject Categories)以及Glänzel and Schubert (2003)的領域/次領域分類(field/subfield classification)SOOI,兩種以演算法為基礎的期刊分類則分別是Blondel et al. (2008)提出的展開式(unfolding)社群偵測(community detection)法以及Rosvall, and Bergstrom (2008)的隨機漫步(random walk)矩陣分解(matrix decomposition)法。若是利用以內容為基礎的分類,期刊可以同時指定多個類別;以演算法為基礎的期刊分類則可以使類別內的引用(within-category citation)對類別內的引用(between-category citation)的比率最大化,也就是將期刊彼此之間的引用資料排列成矩陣,經過適當的行列排列後,使得主要對角線(principal diagonal)附近的數值較大,而其他地方則接近0。

各種分類的相關統計資料如表1所示:


由於以內容為基礎的分類方法具有多重分類特性以及以演算法為基礎的分類方法以矩陣分解為目的,從表1上可以觀察到兩種現象:1) 在類別內期刊數的中位數方面,可以看到以內容為基礎的兩種期刊分類方法較以演算法為基礎的期刊分類方法來得多,可配合圖1每個類別期刊數的分佈在0.50上所呈現的情形。另外,圖1也可發現四種分類方法都是對數常態分布(log normal distribution),也就是在這四種分類方法中,相對少數的類別擁有大量的期刊,然而許多類別卻只有少量期刊。並且以演算法為基礎的分類方法比以內容為基礎的分類方法更偏斜(more skewed),也較是上述的情況更嚴重。隨機漫步方法的前十個類別共有57%種期刊,展開方法則有50%,但ISI和SOOI則分別只有15%和31%。


2) 從引用的分布情形來看,兩種以內容為基礎的分類方法的引用次數總計比以演算法為基礎的分類方法多,但隨機漫步方法和展開方法有較多比率分布在類別內,但ISI和SOOI則是主要分布在類別之間。

接下來,以引用式樣(citation patterns)的餘弦相似性(cosine similarity),比較各種分類方法的類別彼此間的相似性。結果ISI和SOOI的中位數分別是0.020和0.066,比隨機漫步方法和展開方法的0.009和0.007高許多,其原因同樣是因為內容為基礎的方法有多重分類的特性,因此類別間的邊緣較模糊,而演算法為基礎的方法在類別間切割得較清楚。然後將各種分類方法的類別依照它們的相似性繪製成網路圖。四種方法繪製的網路圖大致上都可以看出包含兩大群,一個是生物醫學,另一個則是物理學與工程學,兩個大群體透過三個群體相連,包括化學、地理學-環境科學-生態學群體、以及電腦科學,社會科學群體在網路圖上有些分離,透過行為科學/神經科學和生物醫學相連,並且也透過電腦科學與數學和物理學/工程學相連。綜上所述,不同的科學地圖是相似的,但它們在群體內部類別的密度不同。

In this study, we test the results of two recently available algorithms for the decomposition of large matrices against two content-based classifications of journals: the ISI Subject Categories and the field/subfield classification of Glänzel and Schubert (2003).

The content-based schemes allow for the attribution of more than a single category to a journal, whereas the algorithms maximize the ratio of within-category citations over between-category citations in the aggregated category-category citation matrix.

At that time, Leydesdorff & Rafols (2009) were deeply involved in testing the ISI Subject Categories of these same journals in terms of their disciplinary organization. Using the JCR of the Science Citation Index (SCI), we found 14 major components using 172 subject categories, and 6,164 journals in 2006. Given our analytical objectives and the well-known differences in citation behaviour within the social sciences (Bensman,2008), we decided to set aside the study of the (220 − 175 = ) 45 subject categories in the social sciences for a future study.

Our findings using the SCI indicated that the ISI Subject Categories can be used for statistical mapping purposes at the global level despite being imprecise in terms of the detailed attribution of journals to the categories.

In this study, we compare the results of these two algorithms with the full set of 220 Subject Categories of the ISI. In addition to these three decompositions, a fourth classification system of journals was proposed by Glänzel and Schubert (2003) and increasingly used for evaluation purposes by the Steungroep Onderwijs and Onderzoek Indicatoren (SOOI) in Leuven, Belgium. These authors originally proposed 12 fields and 60 subfields for the SCI, and three fields and seven subfields for the Social Science Citation Index and the Arts and Humanities Citation Index. Later, one more subfield entitled “multidisciplinary sciences” was added.

Thus, because research topics are, on the one hand, thinly spread outside the core group and, on the other hand, the core groups are interwoven, one cannot expect that the aggregated journal-journal citation matrix matches one-to-one with substantive definitions of categories or that it can be decomposed in a single and unique way in relation to scientific specialties. The choice of an appropriate journal set can be considered as a local optimization problem (Leydesdorff, 2006).

Citation relations among journals are dense in discipline-specific clusters and are otherwise very sparse, to the extent of being virtually non-existent (Leydesdorff & Cozzens, 2003).

The grand matrix of aggregated journal-journal citations is so heavily structured that the mappings and analyses in terms of citation distributions have been amazingly robust despite differences in methodologies (e.g., Leydesdorff, 1987 and 2007; Tijssen, de Leeuw, & van Raan, 1987; Boyack, Klavans, & Börner, 2005; Moya-Anegón et al., 2007; Klavans & Boyack, 2009).

A decomposable matrix is a square matrix such that a rearrangement of rows and columns leaves a set of square sub-matrices on the principal diagonal and zeros everywhere else.

In the case of a nearly decomposable matrix, some zeros are replaced by relatively small nonzero numbers (Simon & Ando, 1961; Ando & Fisher, 1963). Near-decomposability is a general property of complex and evolving systems (Simon, 1973 and 2002).

The decomposition into nearly decomposable matrices has no analytical solution. However, algorithms can provide heuristic decompositions when there is no single unique correct answer.

Newman (2006a, 2006b) proposed using modularity for the decomposition of nearly decomposable matrices since modularity can be maximized as an objective function.

Blondel et al. (2008) used this function for relocating units iteratively in neighbouring clusters. Each decomposition can then be considered in terms of whether it increases the modularity.

Analogously, Rosvall, and Bergstrom (2008) maximized the probabilistic entropy between clusters by estimating the fraction of time during which every node is visited in a random walk (cf. Theil, 1972; Leydesdorff, 1991).

The data were harvested from the CD-Rom version of the JCR of the SCI and Social Science Citation Index 2006, and then combined. ... The resulting set of 7,611 journals and their citation relations otherwise precisely corresponds to the online version of the JCRs. This large data matrix of 7,611 times 7,611 citing and cited journals was stored conveniently as a Pajek (.net) file and used for further processing.

The 7,611 journals are attributed by the ISI with 11,856 subject classifiers. This is 1.56 (±0.76) classifiers per journal. The ISI staff assign the 220 ISI Subject Categories on the basis of a number of criteria including the journal's title and its citation patterns (McVeigh, personal communication, March 9, 2006; Bensman & Leydesdorff, 2009).

According to the evaluation of Pudovkin and Garfield (2002), in many fields these categories are sufficient, but the authors added that “in many areas of research these ‘classifications’ are crude and do not permit the user to quickly learn which journals are most closely related” (p. 1113).

Leydesdorff and Rafols (2009) found that the ISI Subject Categories can be used for statistical purposes—the factor analysis for example can remove the noise—but not for the detailed evaluation. In the case of interdisciplinary fields, problems of imprecise or potentially erroneous classifications can be expected.

For the purpose of developing a new classification scheme of scientific journals contained in the SCIs, Glänzel and Schubert (2003) used three successive steps for their attribution. The authors iteratively distinguished sets cognitively on the basis of expert judgements, pragmatically to retain multiple assignments within reasonable limits, and scientometrically using unambiguous core journals for the classification. The scheme of 15 fields and 68 subfields is used extensively for research evaluations by the Steunpunt Onderwijs and Onderzoek Indicatoren (SOOI), a research unit at the Catholic University in Leuven, Belgium, headed by Glänzel.

The SOOI categories cover 8,985 journals. Using the full titles of the journals, 7,485 could be matched with the 7,611 journals under study in the JCR data for 2006 (which is 98.3%). These journals are attributed 10,840 classifiers at the subfield level. This is 1.45 (±0.66) categories per journal. One category (“Philosophy and Religion”) is missing because the Arts & Humanities Citation Index is not included in our data. Thus, we pursued the analysis with the 67 SOOI categories.

Using Rosvall and Bergstrom's (2008) algorithm with 2006 data, we obtained findings similar to those of these authors on August 11, 2008. Like the original authors using 6,128 journals in 2004, we found 88 clusters using 7,611 journals in 2006.

Lambiotte, one of the coauthors of Blondel et al. (2008), was so kind as to input the data into the unfolding algorithm and found the following results: 114 communities with a modularity value of 0.527708 and 14 communities with a modularity value of 0.60345. We use the 114 communities for the purposes of this comparison. These categories refer to 7,607 (= 7611 − 4) journals because four of the journals in the file were isolates.

The number of journals per category is log-normally distributed in each of the four classifications. In other words, they all have a relatively small number of categories with a large number of journals and many categories with only a few journals. However, as shown in Figure 1, the classifications based on the random walk and unfolding algorithms are more skewed than the content-based classifications.



Whereas the top-10 categories on the basis of a random walk comprise 57% of the journals (50% for unfolding), they cover only 15% in the ISI decomposition and 31% for the SOOI classification. In the case of skewed distributions, the characteristic number of journals per category can best be expressed by the median: the median is below 30 in the random walk or unfolding classifications, compared with 42 journals for the ISI classification and 141 for the SOOI classification (Table 1).


As presented in the last rows of Table 1, the total numbers of citations in the aggregated matrices based on the ISI or SOOI classifications are much higher because the same citation can be attributed to two or three categories. Thus, whereas random walk and unfolding lead to matrices with most citations within categories (on the diagonal), matrices based on ISI and SOOI classifications lead to matrices with most citations between categories (off-diagonal).

Finally, to measure how similar the categories in the four decompositions are to each other, we computed the cosine similarities in the citation patterns between each pair of citing categories in the four aggregated category-category matrices (Salton & McGill, 1983; Ahlgren, Jarneving, & Rousseau, 2003).

We find again that all the distributions are highly skewed and that the random walk and unfolding algorithms exhibit a much lower median similarity value among categories. The lower medians indicate that the algorithmic decompositions produce a much “cleaner” cut between categories than the content-based classifications.
In conclusion, the analysis of the statistical properties of the different classifications teaches us that the random walk and the unfolding algorithms produce much more skewed distributions in terms of the number of journals per category, but these constructs are more specific than the content-based classification of the ISI and SOOI. The content-based sets are less divided because the boundaries among them are blurred by the multiple assignments.

In summary, although the correspondences among the main categories are sometimes as low as 50% of the journals, most of the mismatched journals appear to fall in areas within the close vicinity of the main categories. In other words, it seems that the various decompositions are roughly consistent but imprecise.

Maps of science for each decomposition were generated from the aggregated category-category citation matrices using the cosine as similarity measure.

The similarity matrices were visualized with Pajek (Batagelj & Mrvar, 1998) using Kamada and Kawai's (1989) algorithm.

The threshold value of similarity for edge visualization is pragmatically set at cosine > 0.01 for the algorithmic decompositions and cosine > 0.2 for the content-based decompositions to enhance the readability of the maps without affecting the representation of the structures in the data.

For the ISI decomposition, the 220 categories (Figure 3) were clustered into 18 macro-categories (Figure 4) obtained from the factor analysis (cf. Leydesdorff and Rafols, 2009).


The map of the SOOI classification was constructed with all is 67 subfields (Figure 5).


Taking advantage of the concentration of journals in a few categories, in the case of random walk and unfolding only the top 30 and 35 categories were used, respectively.


Indeed, the four maps correspond in displaying two main poles: a very large pole in the biomedical sciences and a second pole in the physical sciences and engineering. These two poles are connected via three bridging areas: chemistry, a geosciences-environment-ecology group, and the computer sciences. The social sciences are somewhat detached, linked via the behavioral sciences/neuroscience to the biomedical pole, and via the computer sciences and mathematics to the physics/engineering pole.

As noted above, although categories of different decompositions do not always match with one another, most “misplaced” journals are assigned into closely neighbouring categories. Therefore, the error in terms of categories is not large and is also unsystematic. The noise-to-signal ratio becomes much smaller when aggregated over the relations among categories.

As a second important observation that can be made on the basis of these maps, we wish to point to the differences in category density between the content-based and the algorithm-based maps.

In summary, we were surprised to find that the different science maps are similar except that they differ in the density of categories within groups.

The content-based classifications achieve a more balanced coverage of the disciplines at the expense of distinguishing categories that may be highly similar in terms of journals.

The first finding is that the algorithmic decompositions have very skewed and clean-cut distributions, with large clusters in a few scientific areas, whereas indexers maintain more even and overlapping distributions in the content-based classifications.

Second, the different classifications show a limited degree of agreement in terms of matching categories. In spite of this lack of agreement, however, the science maps obtained are surprisingly similar; this robustness is due to the fact that although categories do not match precisely, their relative positions in the network among the other categories is based on distributions that match sufficiently to produce corresponding maps at the aggregated level.

2015年4月6日 星期一

Chen, C.-M. (2008), Classification of scientific networks using aggregated journal-journal citation relations in the Journal Citation Reports. Journal of the American Society for Information Science and Technology, 59(14), 2296–2304. doi: 10.1002/asi.20935

Chen, C.-M. (2008), Classification of scientific networks using aggregated journal-journal citation relations in the Journal Citation Reports. Journal of the American Society for Information Science and Technology, 59(14), 2296–2304. doi: 10.1002/asi.20935

本研究利用親似傳導法(affinity propagation method, Frey & Dueck, 2007),以彙整的期刊對期刊引用關係(aggregated journal-journal citation relation),對期刊間由相似的引用樣式(citation patterns)形成的科學網路進行分類。過去已有許多以期刊對期刊引用資料進行分析的研究,例如Pudovkin and Garfield (2002) 根據引用資料,發展關係係數(relatedness factor)來發現意義相關的期刊(semantically related journals);Doreian and Fararo (1985)發現網路上結構對等(structure equivalence)的期刊;Leydesdorff and Cozzens (1993)利用主成分分析(principal component analysis)取得科學網路的特徵向量(eigenvectors)。本研究所使用的引用資料包括2001年的SCI(共使用1905種期刊、426065篇文章以及13798138個引用資料)以及2005年的SSCI(共使用1578種期刊、66051篇文章以及2437389個引用資料)。本研究所使用的親似傳導法利用s(i,j)= −dij測量期刊j可以做為期刊i所在類別代表期刊的適合性,而dij的計算為

csij則是期刊間的引用樣式(citation pattern)的相似性:


親似傳導法反覆計算期刊間的兩種數值估算期刊間的代表性,r(i, j)反應期刊j能否代表期刊i的適合程度,

a(i, j)則反應期刊i是否應選擇期刊j作為代表的適合程度,


對期刊i來說,最大的a(i, j) + r(i, j)便指明哪一個期刊j可以代表它。

根據分類的結果,一個分類的專指性(specificity)可以從所有的成員期刊到此分類的代表期刊的平均距離來表示,愈小的平均距離表示這個分類具有愈高的專指性。成員之間的相關性(relatedness of category members)則以所有的期刊之間的平均距離來表示,愈小表示成員間彼此愈靠近。
本研究對SSCI期刊的分類結果共分為23個分類,每一個分類大致符合SSCI的主題分類,然而分類裡所有成員的平均距離比SSCI相對應的分類還要小。

Traditional classification methods (Glänzel & Schubert, 2003) are based on subjective analysis, whose output could vary from one person to another. In other words, these methods are more artistic than scientific.

On the other hand, a quantitative approach to classification is usually constructed based on a set of simple rules, which offers robust classification schemes that do not rely on human interference.

The aggregated journal-journal (J-J) citation data in JCR contain extensive information about interjournal citations, which could provide an understanding of the interaction among various scientific disciplines.

Based on JCR citation data, Pudovkin and Garfield (2002) have used an intuitive criterion (relatedness factor) for finding semantically related journals.

To avoid subjective analysis, various quantitative methods have been proposed to construct a robust classification system of scientific journals using JCR citation information.

A variety of techniques for analyzing J-J citation relationships have been reported in the literature to cluster scientific journals (Doreian & Fararo, 1985; Leydesdorff, 1986; Tijssen, De Leeuw, & Van Raan, 1987).

For example, by applying the notion of structure equivalence to analyze a small set of journals, Doreian and Fararo (1985) have delineated a set of blocks, which contain journals. These blocks have a very close correspondence to a categorization of the journals based on their aims and objectives.

More recently Leydesdorff and Cozzens (1993) have developed an optimization procedure that stabilizes approximated eigenvectors of the scientific network from principal component analysis as representations of clusters. This principal component analysis has been further extended to rotated component analysis (Leydesdorff, 2006; Leydesdorff & Cozzens, 1993), which enables one to focus on specific subsets with internal coherence.

An alternative method of cocitation clustering has been investigated in constructing a World Atlas of Sciences for ISI (Garfield, Malin, & Small, 1975; Leydesdorff, 1987; Small, 1999).

In this article, I propose a quantitative approach to classify the scientific network in terms of aggregated J-J citation relations of JCR using the affinity propagation method (Frey & Dueck, 2007).

The method used by ISI in establishing journal categories for JCR is a heuristic approach, in which the journal categories have been manually developed initially. The assignment of journals was based upon a visual examination of all relevant citation data.

As the number of journals in a category grew, subdivisions of the category were then established subjectively.

Although this is a useful approach, a more robust, convenient, and automatic classification scheme is desired.

The citation data analyzed include the SCI of 2001 and the SSCI of 2005, which are directly computed from the extraction of the CD version of the ISI database.

There are 2,195 journals of impact factor greater than 1 in the 2001 SCI. After removing 290 journals that did not publish any articles in 2001, there are 1,905 journals left in our data set, which contains 426,065 articles and 13,798,138 citations.

For the 2005 SSCI, there are 1,583 journals in the database, of which 1,578 journals have nonzero contents. The SSCI database contains 66,051 articles and 2,437,389 citations.

In principle, the dissimilarity between two journals can be visualized by the differences in their citation patterns. In other words, the citation pattern of each journal is represented by a normalized citation vector, and these vectors form a rescaled citation matrix. The dissimilarity (or similarity) in citation between two journals is related to the scalar product of their citation vectors.

For mapping or visualization, coefficients of similarity are converted into distances such that closely related journals are short distances apart and remotely related journals are long distances apart.

The affinity propagation method takes as input a collection of similarities between journals, where the similarity s(i, j) measures how well journal j is suited to be the representative of a journal category for journal i. Since the goal is to minimize squared error, we set s(i, j) = −dij.

There are two types of messages exchanged between journals, including the responsibility r(i, j), which is sent from journal i to candidate representative journal (RJ) j, and the availability a(i, j), which is sent from candidate representative journal j to journal i. Here the responsibility reflects the accumulated evidence for how well-suited journal j is to serve as the representative for journal i, and the availability shows the accumulated evidence for how appropriate it would be for journal i to choose journal j as its representative.

Taking into account other potential representative journals for journal i, the responsibility is computed iteratively as

where the initial value of a(i, j) is set to zero in the first iteration. Similarly, taking into account the support from other journals that journal j should be a representative, the availability is updated by gathering evidence from journals as to whether each candidate representative would make a good representative journal:

To reflect accumulated evidence that journal j is a representative based on the positive responsibilities sent to candidate representative j from other journals, the self-availability is updated as

During the process of affinity propagation, the sum of availability and responsibility can be used to identify the representative journal of emerging journal categories. In other words, for any journal i, the value of j that maximizes a(i, j) + r(i, j) identifies that journal j is its representative.

In our classifications, the level of specificity of a category can be found by looking at its value of DRJ (the average distance of members of a category to its representative journal), and relatedness of category members is implied by the value of DJ-J (the average J-J distance within a category).

To demonstrate the applicability of the affinity propagation method in clustering a complete data set of journals, we first apply it to cluster journals in the 2005 SSCI database.

Here the cutoff parameter t is set to 0.0001, implying that the maximal value of DJ-J (DJ-Jmax) is 100. This choice of t is quite reasonable since the probability distribution (PD), or normalized histogram (bin size is 1), of DJ-J in the unclustered SSCI journal database is mostly between 0 and 30, as shown in Figure 1.



With a choice of DJ-Jmax = 100, the distance between unrelated journals is much larger than that between related journals. In other words, for any journal category, unrelated journals will not be located in the vicinity of its members (each journal is considered as a point in a high-dimensional space). Thus only correlated journals will be grouped together by the affinity propagation method.

However, if DJ-Jmax is too close to 30, the positions of unrelated journals are not well separated and the distortion to the journal positions due to the introduction of the cutoff would affect the clustering of journals.

For the predicted SSCI classification, only those J-J distances within the same category are considered in calculating its PD of DJ-J.

In Figure 1, there are two peaks observed from the statistical curves of PD in DJ-J, where the first peak shows the relatedness between journals within the database (or categories), while the second peak at DJ-J = 100 indicates the irrelevance between journals within the database (or categories).

For the predicted SSCI classification, clearly its first peak in the PD of DJ-J is much more prominent and the peak width is much more narrow than that of the unclustered SSCI database.

On the other hand, its second peak of irrelevance is much smaller than that of the unclustered database.

The probability distribution of the first peak is found to decrease exponentially with DJ-J, i.e., P = P0 exp[−(DJ-J − d0)/ Δ], where P0 is the peak value, d0 is the peak position, and Δ is the decay width. By fitting the statistical data, we find that d0 = 4 and Δ = 9.08 for the unclustered SSCI curve, while d0 = 2 and Δ = 1.72 for the clustered SSCI curve.

The entire journal set of SSCI is decomposed into 23 journal categories.

The relatedness of journals within a category can be seen as the average value of DJ-J within the category, and the specificity of a category is related to the average distance of category members to its RJ.

For any category, a smaller value of DRJ implies a higher level of specificity, and a smaller value of DJ-J implies that journals within a category are more closely related to each other.

In general most categories in our classification scheme have a corresponding category in the ISI classification scheme, and their value of DJ-J seems to be smaller than that of their counterpart in the ISI classification scheme.

When a larger value of the cutoff parameter is used, the maximal distance of DJ-J becomes smaller. ... Since the high-dimensional J-J distance space is now approximated by a high-dimensional sphere of smaller radius, the resolution in clustering journals is higher in this case. Thus the SCI database is expected to be decomposed into more clusters for t = 10−3, compared to the case of t = 10−4. ... Therefore, from comparing clustering results with different values of the cutoff parameter, the relationship among various disciplines can be revealed.

Our results demonstrate that the affinity propagation method can provide a reasonable classification scheme for either a complete database or an incomplete database. This method does not need the number of categories or their size as an input.

Distance between journals is calculated from the similarity of their annual citation patterns with a cutoff parameter to restrain the maximal distance.

Different values of the cutoff parameter lead to different levels of resolution in the classification of journal network. A more coarse-grained classification is obtained when a smaller value of the cutoff parameter (or a larger maximal J-J distance) is used.

We note that, unlike the ISI classification scheme, which allows overlap in the content of journal categories by subjective decisions, each journal uniquely belongs to a category in our classification scheme.