2015年12月23日 星期三

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics, 6(1), 19-36.

Gan, Q., Zhu, M., Li, M., Liang, T., Cao, Y., & Zhou, B. (2014). Document visualization: an overview of current research. Wiley Interdisciplinary Reviews: Computational Statistics6(1), 19-36.

文件視覺化(document visualization)是一種資訊視覺化技術,將詞語、文句、文件或它們之間的關係等文字資訊轉換為視覺形式,使得使用者在面臨大量的文件時可以更好的了解文件、減輕他們的心理負荷。文件通常較缺乏結構(minimally structured),但有豐富的特徵(attributes)和後設資料(metadata),因此相較於文本視覺化(text visualization),文件視覺化主要著重在文件以及其包含的特徵和後設資料上。以下是幾種可能的應用:(1) 詞語的頻次與分布; (2) 語意內容與重複 (semantic content and repetition); (3) 區別文件集群的主題; (4) 文件的核心內容;(5) 文件間的相似性;(6) 文件間的連結;(7) 文件內容改變的過程;以及 (8) 社交媒體上的資訊擴散與其他模式以及做為改善文本搜尋的方式。

本研究以視覺化的對象(visualization objects)與任務對蒐集到的文件視覺化技術進行分析,視覺化的對象分為單一文件、文件集合、串流文本訊息以及檢索結果,以下對各種視覺化任務進行說明:

單一文件的視覺化目的在快速了解與吸收核心內容與文本特徵,著重在詞語、片語、語意關係和內容上,分為三種類型:
1.呈現詞語頻次、分布與語彙結構等語彙特徵的語彙為基礎 (Vocabulary-Based)視覺化:
重要的技術有Tag Clouds [6,7]與Wordle [8,9],這類技術利用位置、顏色與大小等方式呈現單一文件中的詞語頻次,近來parallel tag clouds (PTC) [10]、 ManiWordle [11]、 context preserving dynamic word cloud[12]和visualization of internet discussion with extruded word clouds. [13]等許多研究以這類方法為基礎,並加以改善。其他屬於這類但原理不同的方法還有TextArc [14]和DocuBurst [16]。
2. 呈現實體與其間關係的語意結構視覺化:
Semantic Graphs [19]利用Penn Treebank產生的剖析樹(parse tree),產生每一句子內的主語-動詞-受語,在解決代名詞指代(pronominal anaphors)問題,將各實體相連,產生語意圖(semantic graph)。
3.呈現文件內容(Document Content)的特性與關係為基礎的視覺化:
例如WordTree [23]以樹狀結構表現詞語的上下文脈絡,樹的根節點是使用者選取的詞語,每一個分支表示詞語在文件上的上下文,節點大小代表各詞語的頻次。Arc Diagrams [25] 以半圓形的弧連結重複的次序列,用來顯示內容上重複的複雜模式(complex patterns of repetition)。

文件集合的視覺化在於顯現文件群集上的主題、文件間的相似與差異以及內容在時間上的改變,相關技術可分為
1. 文件主題的視覺化:目的在發現特定的主題以及反應各個不同主題之間的關係,著名的研究案例有ThemeScapes [26]和 INSPIRE的 ThemeView以及 The Galaxy [27],分別以地形圖和散佈圖表現文件在主題上的分布情形,地形圖上各「山脈」的高度表示主題的強度。TopicNets則以網路圖的節點與連線呈現文件間在主題上的關係。ThemeRiver [29]和Topic Island [30]著重在文件集合內主題在時間的變化,以ThemeRiver [29]來說,X軸表時間,Y軸上則以不同顏色的「河流」代表各主題,河流的寬度表現主題主題在相關文件上的強度。
2. 文件核心內容的視覺化:目的在提供整個文件集合的概觀,例如Document Cards [35]用來呈現大量的文件集合,每一張卡片上包含文件上重要的詞語和影像,而詞語是由文本探勘(text mining)技術由文件上的文本抽取出來,影像則從文件上抽取或然後加以組合。
3. 版本更動的視覺化:呈現各版本上的差異。例如:History Flow [36] 的設計是用來顯示維基百科上不同版本的文件內容更動情形以及相對應的作者;另外,也有許多針對軟體程式碼發展的視覺化。
4. 文件關係的視覺化:發現在不同文件上實體的連結,實體包括人、地點、日期和組織等等,提供這個功能的視覺化技術如Jigsaw [46] 。其他的技術,如ContexTour [47]和PivotPaths [49] 為視覺化被使用在論文集合的應用、FacetAtlas [48]則將Google Health文件上的病因、症狀、處方和診斷等實體相連,
5. 文件相似性的視覺化:其目的在將相似的文件置於彼此接近的位置,並且能遠離不相似的文件。過去常利用自組織映射圖 (self-organizing map, SOM) [50],將高維度的資料映射到2維平面上呈現,並使得資料間複雜而非線性的關係能夠以距離方式表現,著名的例子有 Lin 的研究[52]和WEBSOM [53]。

在文本視覺化的應用方面,由於近年社交媒體的盛行,即時的串流文本處理的研究大為盛行,研究問題包括主題與詞語的統計分析與表現、主題相關事件的凸顯以及文本訊息本身的視覺化 [54],Christian Rohrdantz [55]進行了串流文本資料的即時視覺化相關研究的回顧,Whisper [57] 的研究可以追蹤社交媒體上的資訊擴散過程。另一個應用是檢索結果的視覺化介面,早期的研究成果如TileBars [58],Sparkler[59]可同時將多個查詢問句的結果以視覺化的方式呈現,RankSpiral [60] 的視覺化呈現重點在比較多個查詢問句或不同搜尋引擎的檢索結果。

各種技術所提供的取用方法、對文件的要求與主要特色可參考下表



關於主要特色說明如下:
1. 擴充性 (Extension):此方法可適用於大量的文件集合。
2. 多功能性 (Versatility):適用於多種的視覺化任務。
3. 互動性 (Interactivity):提供使用者比較直覺的人機介面,讓使用者參與研究與發展過程。
4. 技術 (Techniques)
在文本處理上採用可擴充 (scalable)、高效能 (high-performance) 的演算法;採用協調多視圖 (Coordinated and Multiple Views);即時處理技術。

本研究並且提出文件視覺化有待發展的兩個研究方向,一為文件視覺化的評鑑方法 [62],另一為理論基礎 [63]。


This overview introduces fundamental concepts of and designs for document visualization, a number of representative methods in the field, and challenges as well as promising directions of future development.

Document visualization is a class of the information visualization techniques that transforms textual information such as words, sentences, documents, and their relationships into a visual form, enabling users to better understand textual documents and to lessen their mental workload when faced with a substantial quantity of available textual documents. [1]

And compared with text visualization that aims to visualize information on the text level, document visualization concentrates more on visualizing documents that include attributes and metadata except the core textual contents.

Document visualization has significant advantages over helping people to analyze and control big quantities of textual information in many cases. For example, we can intuitively get access to (1) word frequency or distribution; (2) semantic content and repetition; (3) the topic or topics that define document clusters; (4) the core content of document; (5) similarity among documents; (6) the connections among documents; (7) how content changes over time; and (8) information diffusion or other interesting patterns in social media, as well as improve text searches.

Generally ‘document’ is a textual record or physical form/representation of ‘information’.

The evolving notion of ‘document’ among Jonathan Priest, Otlet, Briet, Sch ¨ urmeyer, and the other documentalists increasingly emphasized whatever functioned as a document rather than traditional physical forms of documents. [2]

And with the development of digital technology, anything exists physically in a digital environment, such as a mail message or a technical report, could be considered as a document.

Documents are often minimally structured and may be rich with attributes and metadata, especially when concentrated in a specific application domain.

We may learn from the good practical guidelines to create an effective user interface for an interactive information visualization tool, as propounded by Ben Shneiderman who suggested in a form of mantra that an effective information visualization tool should follow the principle:
Overview first, zoom and filter, then details on demand. [4]

The mantra is accompanied by a task taxonomy for information visualizations that specifies seven
tasks at a high level of abstraction [4]:
• Overview. Gain an overview of the entire collection.
• Zoom. Zoom in on items of interest.
• Filter. Filter out uninteresting items.
• Details-on-demand. Select an item or group and get details when needed.
• Relate. View relationship among items.
• History. Keep a history of actions to support undo, replay, and progressive refinement.
• Extract. Allow extraction of sub-collections and of the query parameters.

We firstly divide document visualization methods into three main categories:
(1) single document visualization that has more emphasis on individual words and actual single document contents;
(2) document collection visualization that has more emphasis on large document collections, themes and concepts across collection, and how documents are relate to others;
(3) extended document visualization which often deals with comprehensive tasks, involves other attributes beyond the content of documents, and is always applied in specific field, such as social media and search.

In single document visualization, the goal is to quickly understand and absorb core content and text
features. The visualization focuses on words, phrases, semantic relations, and contents.

1. Vocabulary-Based Visualization
Vocabulary is the basic unit of a document. The visualization assists people in understanding words through visual representation of the document vocabulary features, such as word frequency, word distribution, and lexical structure, thereby providing a general idea of contents and features in a document.

Tag Clouds [6,7] and Wordle [8,9] are representative methods mainly visualizing word frequency. They are widely used in the news media and personal home pages. They provide layouts of raw tokens, colored, and sized by the corresponding word frequency within a single document. We may know the main research areas/content discussed in the text by the compact visual form of words.

Recently, some other methods have been proposed, extending the tag/word cloud, such as
parallel tag clouds (PTC), [10] ManiWordle, [11] context preserving dynamic word cloud,[12] visualization of internet discussion with extruded word clouds. [13]

Other examples: TextArc [14], DocuBurst [16].

2. Visualization Based on Semantic Structure

Visualization based on semantic structure usually use entities and their relationships to reveal the semantic content.

Semantic Graphs [19] is a visualization based on the semantic representation of a document in the form of a semantic graph. Firstly, it extracts subject–verb–object for each sentence by the Penn Treebank parse tree. Then, it links the triplets to their corresponding entity, which needs to resolve pronominal anaphors as well as to attach the associate WordNet synset. Thus, the document is summarized with the semantic graph and the list of extracted triplets.

3. Visualization Based on Document Content
Visualization based on document content is not only to search for specific words but also to obtain the characteristics and relations of the contents in the document.

The WordTree visualization provides the representation of both word frequency and context. Size is used to represent frequency of the term or phrase. The root of the tree is a user-selected word or phrase, and the branches represent the contexts in which the word or phrase is used in the document. Users can click on a branch, choose a different search term or re-center the tree. [23]

Martin Wattenberg’s Arc Diagrams [25] is a visualization method that focuses on showing complex patterns of repetition. It is suited to the analysis of highly structured data like musical compositions and less well-structured data like a web page. Repeated subsequences are identified and connected by semicircular arcs. Height of the arcs represents the distance between the subsequences; and thickness of the arcs represents the length of the subsequences.

Document Collection Visualization

Document collection visualization usually intends to reveal the topic or topics that define document clusters, the similarities and differences among documents, and how contents change over time.

1. Visualization of Document Themes

The main goal is to discover one or more specific topics and to reflect the relationships among various topics.

It may be used to find hot disciplines, evolutions, and trends.

The methods, such as ThemeScapes, [26] INSPIRE’s ThemeView, and The Galaxy, [27] all developed by the Pacific Northwest National Laboratory, having less emphasis on the time factor, focus more on characteristics of the document themes at some specific points.

ThemeView uses a 3D terrain map display to represent different themes. The height of a mountain represents the theme’s strength, and the distance between two mountains represents the similarity between the two themes. Keywords are used to distinguish each mountain. [27]

The Galaxy visualization uses a similar approach that themes are visualized as 2D clouds of document points-stars in a theme galaxy (Figure 8(b)). [27]

There are other representations for visualizing documents and topics as nodes in a node-link graph. TopicNets is a web-based system for visual and interactive analysis of large sets of documents using statistical topic models. [28] The main view is a document topic graph which can allow aggregate nodes. The time dimension is represented as a separate visualization, with documents placed chronologically around a broken circle, and connected to related topic nodes which are placed inside the circle.

The methods, such as ThemeRiver [29] and Topic Island, [30] have greater emphasis on the time factor, focusing more on visualizing thematic variations over time within a collection of documents.

ThemeRiver is in the form of axes, with the X-axis representing time and the Y-axis representing different themes. The ‘river’ flows from left to right through time, changing width to portray changes in theme strength of corresponding documents. Rivers of different colors represent different themes, and the width of river (i.e., narrow or wide) indicates the strength (decreasing or increasing) of an individual topic in the associated documents. [29]

2. Visualization of Document Core Content

Visualization of document core content mainly intends to give an overview of a collection of documents without reading them entirely.

Document Cards [35] visualizes large document collections, such as paper collections and news reports, which contain both texts and images to describe facts, methods, or stories. It represents the document’s key content as a mixture of images and important terms, similar to cards in a top trumps game. [35]

The pipeline for creating Document Cards is as follows: firstly, extract the text from the original document, and use a text mining approach to extract the key terms; then go to the phases of image extraction, including image processing and image packing; finally layout the extracted key terms and images to generate the corresponding document cards.

3. Visualization of Changes over Different Versions

Visualization of changes over different versions is used to visualize differences among multiple document versions that are generated over time.

History Flow [36] is designed to show changes between multiple document versions on Wikipedia. It can visualize the process of content changes and the corresponding authors who make the amendments. It also reveals some complex patterns of cooperation and confliction, such as vandalism and repair, anonymity versus named authorship, negotiation, and content stability.

Software visualization [37–39] focuses on visualizing the software development. SeeSoft, [40] Augur, [41] and Advizor [42] are visualizations for code documents. Xia gives visual insight into version control activities, like architectural and coding differences between two software versions. [43] Beagle visualizes changes among different released versions. [44] Spectrograph shows the time and location where changes happen in the system. [45]

4. Visualization of Document Relationships

With gradual increases in document quantity, the concepts and entities within documents become larger and larger, making the analyst’s task of evaluation and sense-making more difficult. Thus, it is quite meaningful to visualize connections among documents. The visualization focuses on the correlation among documents, like the connections among entities across different documents.

Jigsaw [46] is an interactive visualization for document exploration and sense-making, and it supports the analysis of relationships among documents. It visually shows connections between entities in the documents; where entities could be people, places, dates, organizations, and so on. It is suitable to documents describing a set of observations or facts, like news stories and case reports. It provides multiple views and each view provides a different perspective.

There are other methods for visualizing relations among multiple facets. ContexTour [47] presents the relations among conferences, authors, and topics in paper collections. FacetAtlas [48] shows relations among causes, symptoms, treatments, and diagnoses in Google Health documents. PivotPaths [49] visually explores relations of authors, keywords, and citations in academic publications.

5. Visualization of Document Similarity

In many cases of document collection visualizations, the goal is to place similar documents close to each other and dissimilar ones far apart.

The self-organizing map (SOM) [50] is a nonlinear projection method. It expresses complex, nonlinear relationships between high dimensional data items into simple geometric relationships on a 2D display.

When applying to information retrieval, it usually uses map displays. [51] Different colored areas represent different concepts in documents. Size of area indicates its relative importance in collection. Neighboring regions show commonalities in concepts. Dots in regions can represent documents. Additional information can be referred in Xia Lin’s map display [52] and WEBSOM. [53]

With the rise of social media (a textual medium), text streams, such as Twitter posts, are being generated in volumes that grow every day. A large body of research has appeared in recent years. Those works have different focuses and always involve multiple targets, such as dealing with the statistical analysis and presentation of topics or terms, focusing on the emergence of topic events, [33] and visualizing the text messages themselves. [54]

Christian Rohrdantz [55] provides an overview of real-time visualization of streaming text data.

STREAMIT [56] presents a similar visual representation of text streams which applies to news documents.

Whisper [57] fulfills the requirement for tracing information diffusion processes in social media, in a real-time manner.

Search Visualization visualizes the results of search operations. The relatively early approach is TileBars [58] that intends to minimize time and effort for deciding which documents to view in detail.

Susan Havre [59] introduces a graphical method called Sparkler for visually presenting and exploring the results of multiple queries simultaneously.

RankSpiral addresses the problem of how to enable users to visually explore and compare large sets of documents that have been retrieved by different search engines or queries. [60]

We have mainly considered the visualization objects and tasks when classifying document visualization methods. Our classification is considered more acceptable than other classifications (e.g., representations: pixel-based, map-based, tree-based graphs, node-link diagrams, circle graphs, etc.1), since visualization is usually task dependent, and users commonly begin with data and tasks. Actually, each method may belong to different category even under the same classification criteria; and we classify each method according to its key visualization focus (the visualization objects and tasks).

In Table 1, we summarize and compare those methods mainly from four aspects to give readers a brief view.

• Characteristics visualized. The characteristics of a document visualized by the method, as word frequency, semantic relations, content, changes, or connections among documents.

• Principles satisfied. The design principles satisfied, as noted in Document and Document Design section, the seven tasks: 1) Overview; 2) Zoom; 3) Filter; 4) Details-on-demand; 5) Relate; 6) History; 7) Extract.

• Requirements for a document. Document types suitable to the visualization method, i.e., whether the visualization method has special requirements for a document, like document content, structure, etc.

• Main features. Discuss the visualization method’s features, especially the versatility and interactivity.

Despite this, document visualization shares the same pipeline: get the data (a document or documents), transform it into vectors, then run algorithms based on the tasks of interest (i.e., similarity, search, clustering) and generate the visualizations.


Document visualization techniques combine human wisdom and computer graphics, allowing users to efficiently and intuitively browse, explore, and understand the increasing quantity of documents.

1. Extension: Existing methods can be extended to suit for large-scale document collections.
2. Versatility: It is significant to design relatively general visualization models for different tasks within this field, since existing methods always have narrow scope of application due to its pointed direction.
3. Interactivity: It is important to design a more intuitive man–machine interface to improve user’s experience of interaction. Also it is crucial to find some interstices to allow users to participate in researching and developing process, especially the testing period.
4. Techniques:
• Algorithms. Develop and adopt scalable, high-performance algorithms for text processing, such as text summary and clustering.
• Parallel processing technology. With the adoption and popularity of Coordinated and Multiple Views (CMV), a visualization system usually includes multi-views.
• Real-time processing technology.

1. Evaluation Many document visualization methods or even information visualization methods lack a quantitative measurement which can indicate the overall quality, novelty, uncertainly, and other evaluative metrics. More recently, there exist more and more publications that reflect upon current practices in visualization evaluation. In fact, the BELIV workshop was created as a venue for researchers and practitioners to ‘explore novel evaluation methods, and to structure the knowledge on evaluation in information visualization around a schema’. [62]

2. Theoretical Foundations The 2007 Dagstuhl Workshop identified collaborative information visualization with theory building as major directions for future development. [63]

沒有留言:

張貼留言