2015年12月23日 星期三

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics, 6(1), 19-36.

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics6(1), 19-36.

文件視覺化(document visualization)是一種資訊視覺化技術,將詞語、文句、文件或它們之間的關係等文字資訊轉換為視覺形式,使得使用者在面臨大量的文件時可以更好的了解文件、減輕他們的心理負荷。文件通常較缺乏結構(minimally structured),但有豐富的特徵(attributes)和後設資料(metadata),因此相較於文本視覺化(text visualization),文件視覺化主要著重在文件以及其包含的特徵和後設資料上。以下是幾種可能的應用:(1) 詞語的頻次與分布; (2) 語意內容與重複 (semantic content and repetition); (3) 區別文件集群的主題; (4) 文件的核心內容;(5) 文件間的相似性;(6) 文件間的連結;(7) 文件內容改變的過程;以及 (8) 社交媒體上的資訊擴散與其他模式以及做為改善文本搜尋的方式。

本研究以視覺化的對象(visualization objects)與任務對蒐集到的文件視覺化技術進行分析,視覺化的對象分為單一文件、文件集合、串流文本訊息以及檢索結果,以下對各種視覺化任務進行說明:

單一文件的視覺化目的在快速了解與吸收核心內容與文本特徵,著重在詞語、片語、語意關係和內容上,分為三種類型:
1.呈現詞語頻次、分布與語彙結構等語彙特徵的語彙為基礎 (Vocabulary-Based)視覺化:
重要的技術有Tag Clouds [6,7]與Wordle [8,9],這類技術利用位置、顏色與大小等方式呈現單一文件中的詞語頻次,近來parallel tag clouds (PTC) [10]、 ManiWordle [11]、 context preserving dynamic word cloud[12]和visualization of internet discussion with extruded word clouds. [13]等許多研究以這類方法為基礎,並加以改善。其他屬於這類但原理不同的方法還有TextArc [14]和DocuBurst [16]。
2. 呈現實體與其間關係的語意結構視覺化:
Semantic Graphs [19]利用Penn Treebank產生的剖析樹(parse tree),產生每一句子內的主語-動詞-受語,在解決代名詞指代(pronominal anaphors)問題,將各實體相連,產生語意圖(semantic graph)。
3.呈現文件內容(Document Content)的特性與關係為基礎的視覺化:
例如WordTree [23]以樹狀結構表現詞語的上下文脈絡,樹的根節點是使用者選取的詞語,每一個分支表示詞語在文件上的上下文,節點大小代表各詞語的頻次。Arc Diagrams [25] 以半圓形的弧連結重複的次序列,用來顯示內容上重複的複雜模式(complex patterns of repetition)。

文件集合的視覺化在於顯現文件群集上的主題、文件間的相似與差異以及內容在時間上的改變,相關技術可分為
1. 文件主題的視覺化:目的在發現特定的主題以及反應各個不同主題之間的關係,著名的研究案例有ThemeScapes [26]和 INSPIRE的 ThemeView以及 The Galaxy [27],分別以地形圖和散佈圖表現文件在主題上的分布情形,地形圖上各「山脈」的高度表示主題的強度。TopicNets則以網路圖的節點與連線呈現文件間在主題上的關係。ThemeRiver [29]和Topic Island [30]著重在文件集合內主題在時間的變化,以ThemeRiver [29]來說,X軸表時間,Y軸上則以不同顏色的「河流」代表各主題,河流的寬度表現主題主題在相關文件上的強度。
2. 文件核心內容的視覺化:目的在提供整個文件集合的概觀,例如Document Cards [35]用來呈現大量的文件集合,每一張卡片上包含文件上重要的詞語和影像,而詞語是由文本探勘(text mining)技術由文件上的文本抽取出來,影像則從文件上抽取或然後加以組合。
3. 版本更動的視覺化:呈現各版本上的差異。例如:History Flow [36] 的設計是用來顯示維基百科上不同版本的文件內容更動情形以及相對應的作者;另外,也有許多針對軟體程式碼發展的視覺化。
4. 文件關係的視覺化:發現在不同文件上實體的連結,實體包括人、地點、日期和組織等等,提供這個功能的視覺化技術如Jigsaw [46] 。其他的技術,如ContexTour [47]和PivotPaths [49] 為視覺化被使用在論文集合的應用、FacetAtlas [48]則將Google Health文件上的病因、症狀、處方和診斷等實體相連,
5. 文件相似性的視覺化:其目的在將相似的文件置於彼此接近的位置,並且能遠離不相似的文件。過去常利用自組織映射圖 (self-organizing map, SOM) [50],將高維度的資料映射到2維平面上呈現,並使得資料間複雜而非線性的關係能夠以距離方式表現,著名的例子有 Lin 的研究[52]和WEBSOM [53]。

在文本視覺化的應用方面,由於近年社交媒體的盛行,即時的串流文本處理的研究大為盛行,研究問題包括主題與詞語的統計分析與表現、主題相關事件的凸顯以及文本訊息本身的視覺化 [54],Christian Rohrdantz [55]進行了串流文本資料的即時視覺化相關研究的回顧,Whisper [57] 的研究可以追蹤社交媒體上的資訊擴散過程。另一個應用是檢索結果的視覺化介面,早期的研究成果如TileBars [58],Sparkler[59]可同時將多個查詢問句的結果以視覺化的方式呈現,RankSpiral [60] 的視覺化呈現重點在比較多個查詢問句或不同搜尋引擎的檢索結果。

各種技術所提供的取用方法、對文件的要求與主要特色可參考下表



關於主要特色說明如下:
1. 擴充性 (Extension):此方法可適用於大量的文件集合。
2. 多功能性 (Versatility):適用於多種的視覺化任務。
3. 互動性 (Interactivity):提供使用者比較直覺的人機介面,讓使用者參與研究與發展過程。
4. 技術 (Techniques)
在文本處理上採用可擴充 (scalable)、高效能 (high-performance) 的演算法;採用協調多視圖 (Coordinated and Multiple Views);即時處理技術。

本研究並且提出文件視覺化有待發展的兩個研究方向,一為文件視覺化的評鑑方法 [62],另一為理論基礎 [63]。


This overview introduces fundamental concepts of and designs for document visualization, a number of representative methods in the field, and challenges as well as promising directions of future development.

Document visualization is a class of the information visualization techniques that transforms textual information such as words, sentences, documents, and their relationships into a visual form, enabling users to better understand textual documents and to lessen their mental workload when faced with a substantial quantity of available textual documents. [1]

And compared with text visualization that aims to visualize information on the text level, document visualization concentrates more on visualizing documents that include attributes and metadata except the core textual contents.

Document visualization has significant advantages over helping people to analyze and control big quantities of textual information in many cases. For example, we can intuitively get access to (1) word frequency or distribution; (2) semantic content and repetition; (3) the topic or topics that define document clusters; (4) the core content of document; (5) similarity among documents; (6) the connections among documents; (7) how content changes over time; and (8) information diffusion or other interesting patterns in social media, as well as improve text searches.

Generally ‘document’ is a textual record or physical form/representation of ‘information’.

The evolving notion of ‘document’ among Jonathan Priest, Otlet, Briet, Sch ¨ urmeyer, and the other documentalists increasingly emphasized whatever functioned as a document rather than traditional physical forms of documents. [2]

And with the development of digital technology, anything exists physically in a digital environment, such as a mail message or a technical report, could be considered as a document.

Documents are often minimally structured and may be rich with attributes and metadata, especially when concentrated in a specific application domain.

We may learn from the good practical guidelines to create an effective user interface for an interactive information visualization tool, as propounded by Ben Shneiderman who suggested in a form of mantra that an effective information visualization tool should follow the principle:
Overview first, zoom and filter, then details on demand. [4]

The mantra is accompanied by a task taxonomy for information visualizations that specifies seven
tasks at a high level of abstraction [4]:
• Overview. Gain an overview of the entire collection.
• Zoom. Zoom in on items of interest.
• Filter. Filter out uninteresting items.
• Details-on-demand. Select an item or group and get details when needed.
• Relate. View relationship among items.
• History. Keep a history of actions to support undo, replay, and progressive refinement.
• Extract. Allow extraction of sub-collections and of the query parameters.

We firstly divide document visualization methods into three main categories:
(1) single document visualization that has more emphasis on individual words and actual single document contents;
(2) document collection visualization that has more emphasis on large document collections, themes and concepts across collection, and how documents are relate to others;
(3) extended document visualization which often deals with comprehensive tasks, involves other attributes beyond the content of documents, and is always applied in specific field, such as social media and search.

In single document visualization, the goal is to quickly understand and absorb core content and text
features. The visualization focuses on words, phrases, semantic relations, and contents.

1. Vocabulary-Based Visualization
Vocabulary is the basic unit of a document. The visualization assists people in understanding words through visual representation of the document vocabulary features, such as word frequency, word distribution, and lexical structure, thereby providing a general idea of contents and features in a document.

Tag Clouds [6,7] and Wordle [8,9] are representative methods mainly visualizing word frequency. They are widely used in the news media and personal home pages. They provide layouts of raw tokens, colored, and sized by the corresponding word frequency within a single document. We may know the main research areas/content discussed in the text by the compact visual form of words.

Recently, some other methods have been proposed, extending the tag/word cloud, such as
parallel tag clouds (PTC), [10] ManiWordle, [11] context preserving dynamic word cloud,[12] visualization of internet discussion with extruded word clouds. [13]

Other examples: TextArc [14], DocuBurst [16].

2. Visualization Based on Semantic Structure

Visualization based on semantic structure usually use entities and their relationships to reveal the semantic content.

Semantic Graphs [19] is a visualization based on the semantic representation of a document in the form of a semantic graph. Firstly, it extracts subject–verb–object for each sentence by the Penn Treebank parse tree. Then, it links the triplets to their corresponding entity, which needs to resolve pronominal anaphors as well as to attach the associate WordNet synset. Thus, the document is summarized with the semantic graph and the list of extracted triplets.

3. Visualization Based on Document Content
Visualization based on document content is not only to search for specific words but also to obtain the characteristics and relations of the contents in the document.

The WordTree visualization provides the representation of both word frequency and context. Size is used to represent frequency of the term or phrase. The root of the tree is a user-selected word or phrase, and the branches represent the contexts in which the word or phrase is used in the document. Users can click on a branch, choose a different search term or re-center the tree. [23]

Martin Wattenberg’s Arc Diagrams [25] is a visualization method that focuses on showing complex patterns of repetition. It is suited to the analysis of highly structured data like musical compositions and less well-structured data like a web page. Repeated subsequences are identified and connected by semicircular arcs. Height of the arcs represents the distance between the subsequences; and thickness of the arcs represents the length of the subsequences.

Document Collection Visualization

Document collection visualization usually intends to reveal the topic or topics that define document clusters, the similarities and differences among documents, and how contents change over time.

1. Visualization of Document Themes

The main goal is to discover one or more specific topics and to reflect the relationships among various topics.

It may be used to find hot disciplines, evolutions, and trends.

The methods, such as ThemeScapes, [26] INSPIRE’s ThemeView, and The Galaxy, [27] all developed by the Pacific Northwest National Laboratory, having less emphasis on the time factor, focus more on characteristics of the document themes at some specific points.

ThemeView uses a 3D terrain map display to represent different themes. The height of a mountain represents the theme’s strength, and the distance between two mountains represents the similarity between the two themes. Keywords are used to distinguish each mountain. [27]

The Galaxy visualization uses a similar approach that themes are visualized as 2D clouds of document points-stars in a theme galaxy (Figure 8(b)). [27]

There are other representations for visualizing documents and topics as nodes in a node-link graph. TopicNets is a web-based system for visual and interactive analysis of large sets of documents using statistical topic models. [28] The main view is a document topic graph which can allow aggregate nodes. The time dimension is represented as a separate visualization, with documents placed chronologically around a broken circle, and connected to related topic nodes which are placed inside the circle.

The methods, such as ThemeRiver [29] and Topic Island, [30] have greater emphasis on the time factor, focusing more on visualizing thematic variations over time within a collection of documents.

ThemeRiver is in the form of axes, with the X-axis representing time and the Y-axis representing different themes. The ‘river’ flows from left to right through time, changing width to portray changes in theme strength of corresponding documents. Rivers of different colors represent different themes, and the width of river (i.e., narrow or wide) indicates the strength (decreasing or increasing) of an individual topic in the associated documents. [29]

2. Visualization of Document Core Content

Visualization of document core content mainly intends to give an overview of a collection of documents without reading them entirely.

Document Cards [35] visualizes large document collections, such as paper collections and news reports, which contain both texts and images to describe facts, methods, or stories. It represents the document’s key content as a mixture of images and important terms, similar to cards in a top trumps game. [35]

The pipeline for creating Document Cards is as follows: firstly, extract the text from the original document, and use a text mining approach to extract the key terms; then go to the phases of image extraction, including image processing and image packing; finally layout the extracted key terms and images to generate the corresponding document cards.

3. Visualization of Changes over Different Versions

Visualization of changes over different versions is used to visualize differences among multiple document versions that are generated over time.

History Flow [36] is designed to show changes between multiple document versions on Wikipedia. It can visualize the process of content changes and the corresponding authors who make the amendments. It also reveals some complex patterns of cooperation and confliction, such as vandalism and repair, anonymity versus named authorship, negotiation, and content stability.

Software visualization [37–39] focuses on visualizing the software development. SeeSoft, [40] Augur, [41] and Advizor [42] are visualizations for code documents. Xia gives visual insight into version control activities, like architectural and coding differences between two software versions. [43] Beagle visualizes changes among different released versions. [44] Spectrograph shows the time and location where changes happen in the system. [45]

4. Visualization of Document Relationships

With gradual increases in document quantity, the concepts and entities within documents become larger and larger, making the analyst’s task of evaluation and sense-making more difficult. Thus, it is quite meaningful to visualize connections among documents. The visualization focuses on the correlation among documents, like the connections among entities across different documents.

Jigsaw [46] is an interactive visualization for document exploration and sense-making, and it supports the analysis of relationships among documents. It visually shows connections between entities in the documents; where entities could be people, places, dates, organizations, and so on. It is suitable to documents describing a set of observations or facts, like news stories and case reports. It provides multiple views and each view provides a different perspective.

There are other methods for visualizing relations among multiple facets. ContexTour [47] presents the relations among conferences, authors, and topics in paper collections. FacetAtlas [48] shows relations among causes, symptoms, treatments, and diagnoses in Google Health documents. PivotPaths [49] visually explores relations of authors, keywords, and citations in academic publications.

5. Visualization of Document Similarity

In many cases of document collection visualizations, the goal is to place similar documents close to each other and dissimilar ones far apart.

The self-organizing map (SOM) [50] is a nonlinear projection method. It expresses complex, nonlinear relationships between high dimensional data items into simple geometric relationships on a 2D display.

When applying to information retrieval, it usually uses map displays. [51] Different colored areas represent different concepts in documents. Size of area indicates its relative importance in collection. Neighboring regions show commonalities in concepts. Dots in regions can represent documents. Additional information can be referred in Xia Lin’s map display [52] and WEBSOM. [53]

With the rise of social media (a textual medium), text streams, such as Twitter posts, are being generated in volumes that grow every day. A large body of research has appeared in recent years. Those works have different focuses and always involve multiple targets, such as dealing with the statistical analysis and presentation of topics or terms, focusing on the emergence of topic events, [33] and visualizing the text messages themselves. [54]

Christian Rohrdantz [55] provides an overview of real-time visualization of streaming text data.

STREAMIT [56] presents a similar visual representation of text streams which applies to news documents.

Whisper [57] fulfills the requirement for tracing information diffusion processes in social media, in a real-time manner.

Search Visualization visualizes the results of search operations. The relatively early approach is TileBars [58] that intends to minimize time and effort for deciding which documents to view in detail.

Susan Havre [59] introduces a graphical method called Sparkler for visually presenting and exploring the results of multiple queries simultaneously.

RankSpiral addresses the problem of how to enable users to visually explore and compare large sets of documents that have been retrieved by different search engines or queries. [60]

We have mainly considered the visualization objects and tasks when classifying document visualization methods. Our classification is considered more acceptable than other classifications (e.g., representations: pixel-based, map-based, tree-based graphs, node-link diagrams, circle graphs, etc.1), since visualization is usually task dependent, and users commonly begin with data and tasks. Actually, each method may belong to different category even under the same classification criteria; and we classify each method according to its key visualization focus (the visualization objects and tasks).

In Table 1, we summarize and compare those methods mainly from four aspects to give readers a brief view.

• Characteristics visualized. The characteristics of a document visualized by the method, as word frequency, semantic relations, content, changes, or connections among documents.

• Principles satisfied. The design principles satisfied, as noted in Document and Document Design section, the seven tasks: 1) Overview; 2) Zoom; 3) Filter; 4) Details-on-demand; 5) Relate; 6) History; 7) Extract.

• Requirements for a document. Document types suitable to the visualization method, i.e., whether the visualization method has special requirements for a document, like document content, structure, etc.

• Main features. Discuss the visualization method’s features, especially the versatility and interactivity.

Despite this, document visualization shares the same pipeline: get the data (a document or documents), transform it into vectors, then run algorithms based on the tasks of interest (i.e., similarity, search, clustering) and generate the visualizations.


Document visualization techniques combine human wisdom and computer graphics, allowing users to efficiently and intuitively browse, explore, and understand the increasing quantity of documents.

1. Extension: Existing methods can be extended to suit for large-scale document collections.
2. Versatility: It is significant to design relatively general visualization models for different tasks within this field, since existing methods always have narrow scope of application due to its pointed direction.
3. Interactivity: It is important to design a more intuitive man–machine interface to improve user’s experience of interaction. Also it is crucial to find some interstices to allow users to participate in researching and developing process, especially the testing period.
4. Techniques:
• Algorithms. Develop and adopt scalable, high-performance algorithms for text processing, such as text summary and clustering.
• Parallel processing technology. With the adoption and popularity of Coordinated and Multiple Views (CMV), a visualization system usually includes multi-views.
• Real-time processing technology.

1. Evaluation Many document visualization methods or even information visualization methods lack a quantitative measurement which can indicate the overall quality, novelty, uncertainly, and other evaluative metrics. More recently, there exist more and more publications that reflect upon current practices in visualization evaluation. In fact, the BELIV workshop was created as a venue for researchers and practitioners to ‘explore novel evaluation methods, and to structure the knowledge on evaluation in information visualization around a schema’. [62]

2. Theoretical Foundations The 2007 Dagstuhl Workshop identified collaborative information visualization with theory building as major directions for future development. [63]

2015年12月18日 星期五

Kucher, K., & Kerren, A. (2015). Text Visualization Techniques: Taxonomy, Visual Survey, and Community Insights. In 8th IEEE Pacific Visualization Symposium (PacificVis' 15), Hangzhou, China (pp. 117-121). IEEE Computer Society.

Kucher, K., & Kerren, A. (2015). Text Visualization Techniques: Taxonomy, Visual Survey, and Community Insights. In 8th IEEE Pacific Visualization Symposium (PacificVis' 15), Hangzhou, China (pp. 117-121). IEEE Computer Society.

近年來由於可以取得大量而多樣的文本資料和採用文本處理演算法等原因,研究人員對文本視覺化(text visualization)與視覺性的文本解析(visual text analytics)的研究興趣增加。本研究針對文本視覺化技術提出一個互動的視覺調查(visual survey)。並且利用此次調查的資料,分析文本視覺化的現況,比較研究使用的各種分析與視覺化技術,以及分析有關研究者的資訊,以提供搜尋相關研究、探索次領域(subfield)以及獲得研究趨勢的洞察等目的

本研究採納前人的研究,將文本視覺化技術,以分析任務(analytic tasks)、視覺化任務(visualization tasks)、資料領域(data domain)以及資料來源(data source)、資料性質(data property)、視覺化的維度(visualization dimensionality)、視覺化的呈現(visualization representation)、視覺化的排列方式(visualization alignment)等面向,建立分類架構(taxonomy)。


分析任務是指使用者採用文本視覺化技術預期達到的主要目的,這些分類包括:
1. 文本摘要 (Text Summarization) / 主題分析 (Topic Analysis) / 實體抽取 (Entity Extraction)
2. 言談分析 (Discourse Analysis):文本或對話轉錄(conversation transcript)裡流動的語言學分析。
3. 情感分析 (Sentiment Analysis)
4. 事件分析 (Event Analysis)
5. 趨勢分析 (Trend Analysis) / 樣式分析 (Pattern Analysis)
6. 詞法/語法分析 (Lexical / Syntactical Analysis)
7. 關係/連結分析 (Relation / Connection Analysis)
8. 翻譯/文本比對分析 (Translation / Text Alignment Analysis)

視覺化任務則是由文本視覺化技術所支援的較基層呈現與互動任務,包括:

1. 自動凸顯/建議興趣區 (Region of Interest)
2. 群集 (Clustering) / 分類 (Classification / Categorization)
3. 比較 (Comparison)
4. 概觀 (Overview)
5. 監視 (Monitoring)
6. 瀏覽 (Navigation) / 探索 (Exploration)
7. 對於不確定的對策 (Uncertainty Tackling)

資料領域,包括

1. 線上社交媒體 (Online Social media)
2. 通訊 (Communication)
3. 專利 (Patents)
4. 評論 (Reviews) / 病歷 (Medical Records)
5. 文學作品 (Literature) / 詩 (Poems)
6. 科學文章 (Scientific Articles) / 論文 (Papers)
7. 社論媒體 (Editorial Media)

資料來源有單一文件 (Document) [33]、語料庫 (Corpora) [25]以及 串流文本 (Streams) [19];特殊的資料性質包括地理空間 (Geospatial) [11]、時間序列 (Timeseries) [14] 以及網路 (Networks) [6];視覺化的再現包括下列項目:折線圖 (Line Plot) / 河流圖 (River) [9, 18]、像素 (Pixel) / 面積 (Area) / 矩陣 (Matrix) [13, 7, 4]、節點-連結 (Node-Link) [32]、雲 (Clouds) / 銀河 (Galaxies) [1, 3]、地圖 (Maps) [34]、文本 (Text) [26]與形符 (Glyph) / 圖標 (Icon) [28, 10];排列則包括了輻射狀 (Radial) [35]、線性 (Linear) / 平行線 (Parallel) [8] 以及測標依賴 (Metric-dependent) [22]。

本研究指出有超過一半(56%)的文本視覺化利用主題模型(topic modeling)技術,資料來源方面大多數支援語料庫(70%),並且許多支援時間相關的資料(43%),而視覺再現方面主題以二維(2-D)為主,僅有極少數的研究以三維(3-D)的方式呈現,約占所有研究的4%。

文本視覺化的前五位主要作者為Daniel A. Keim (17 筆)、Shixia Liu (12 筆)、Christian Rohrdantz (9 筆)、Daniela Oelke (7 筆)和 Huamin Qu (7 筆)。將作者依據他們的合著關係建立研究者合作網路圖後,觀察網路圖的相連成分,可以發現大部分是獨立的小群體,最大的成分上共有106位作者,並且在這個成分上的兩個主要集群為University of Konstanz和Microsoft Research Asia等兩個研究團隊,Daniel A. Keim 和 Shixia Liu分別為集群的中心,並且他們二位也是網路圖上中介中心性最高的節點。雖然在本研究蒐集的資料上,這兩位作者之間並沒有直接的合作關係,但他們都曾與中介中心性第三高的兩位作者Dongning Luo 和 Jing Yang合作。


In this paper, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.

The interest for text visualization and visual text analytics has been increasing for the last ten years. The reasons for this development are manifold, but for sure the availability of large amounts of heterogeneous text data (caused by the popularity of online social media) and the adoption of text processing algorithms (e.g., for topic modeling) by the InfoVis and Visual Analytics communities are two possible explanations.




Analytic Tasks
these items are critical to the main analysis goals that users expect to achieve when employing a text visualization technique.

1. Text Summarization / Topic Analysis / Entity Extraction

2. Discourse Analysis
the linguistic analysis of the flow of text or conversation transcript.

3. Sentiment Analysis
for techniques related to the analysis of sentiment, opinion, and affection.

4. Event Analysis
deal with the extraction of events from the text data or involve visualization of text in some different manner

5. Trend Analysis / Pattern Analysis
both automated trend analysis and manual investigation directed at discovering patterns in the textual data.

6. Lexical / Syntactical Analysis

7. Relation / Connection Analysis

8. Translation / Text Alignment Analysis

Visualization Tasks
lower-level representation and interaction tasks that are supported by the text visualization techniques.

1. Region of Interest
the automatic highlighting/suggestion of data items/regions that could be of interest to the user for more detailed investigation

2. Clustering / Classification / Categorization

3. Comparison

4. Overview
both techniques that provide “the big picture” by displaying a significant portion of the data set as well as techniques which use special aggregated representations to provide overview while reducing the visual complexity

5. Monitoring

6. Navigation / Exploration

7. Uncertainty Tackling


Domain

1. Online Social media

2. Communication

3. Patents

4. Reviews / (Medical) Records

5. Literature / Poems

6. Scientific Articles / Papers

7. Editorial Media

Data sources include the following self-evident items: Document [33], Corpora [25], and Streams [19].

The special data properties include Geospatial [11], Timeseries [14], and Networks [6].

Representation includes the following items: Line Plot / River [9, 18], Pixel / Area / Matrix [13, 7, 4], Node-Link [32], Clouds / Galaxies [1, 3], Maps [34], Text [26], and Glyph / Icon [28, 10].

Alignment, i.e., layout, includes Radial [35], Linear / Parallel [8], and Metric-dependent [22].

As displayed in the table, our proposed taxonomy includes most of the categories except for two: we believe that the underlying data representation (e.g., bag-of-words vs. language model [30] or whole text vs. partial text [24]) is more relevant to the underlying computational methods than to observable visualization techniques.

And the same naturally holds for data processing methods (e.g., the specification of involved MDS methods [2]) that are partially covered by other categories in our taxonomy, for instance, the analytic task of topic analysis implies the usage of corresponding computational methods.

Using the data collected for the survey, we have been able to analyze the general state of the text visualization field, to compare the usage of various analysis and visualization techniques (with regard to our taxonomy), and to analyze the information about researchers in this field.

According to our current set of entries, the trend for rapid increase of text visualization techniques started around 2007.

With regard to category statistics (cf. Fig. 4), there is an obvious interest for tasks related to topic modeling (56% of all entries).

The majority of the techniques support corpora as data sources (70% of all entries), and a lot of them support time-dependent data (43% of all entries).

Another result—which is probably expected—is that only less than 4% of all entries use 3-dimensional visual representations.

We have also taken a look at the authorship statistics for the current data set. The top five authors with regard to number of techniques are Daniel A. Keim (17 entries), Shixia Liu (12 entries), Christian Rohrdantz (9 entries), Daniela Oelke (7 entries), and Huamin Qu (7 entries).

As seen in Fig. 5, the majority of author nodes are included into isolated connected components of small sizes (less than 10 nodes) while there is a big connected component with 106 nodes present in the graph.

The two major clusters in that component represent the research groups from the University of Konstanz and Microsoft Research Asia with Daniel A. Keim and Shixia Liu as cluster center nodes.

Shixia Liu and Daniel A. Keim happen to have the 1st and the 2nd largest betweenness values in the graph, respectively. While these two researchers have no direct collaboration with regard to our data set, they both have collaborated with Dongning Luo and Jing Yang who both share the 3rd largest betweenness value.