近年來由於可以取得大量而多樣的文本資料和採用文本處理演算法等原因,研究人員對文本視覺化(text visualization)與視覺性的文本解析(visual text analytics)的研究興趣增加。本研究針對文本視覺化技術提出一個互動的視覺調查(visual survey)。並且利用此次調查的資料,分析文本視覺化的現況,比較研究使用的各種分析與視覺化技術,以及分析有關研究者的資訊,以提供搜尋相關研究、探索次領域(subfield)以及獲得研究趨勢的洞察等目的
本研究採納前人的研究,將文本視覺化技術,以分析任務(analytic tasks)、視覺化任務(visualization tasks)、資料領域(data domain)以及資料來源(data source)、資料性質(data property)、視覺化的維度(visualization dimensionality)、視覺化的呈現(visualization representation)、視覺化的排列方式(visualization alignment)等面向,建立分類架構(taxonomy)。
分析任務是指使用者採用文本視覺化技術預期達到的主要目的,這些分類包括:
1. 文本摘要 (Text Summarization) / 主題分析 (Topic Analysis) / 實體抽取 (Entity Extraction)
1. 文本摘要 (Text Summarization) / 主題分析 (Topic Analysis) / 實體抽取 (Entity Extraction)
2. 言談分析 (Discourse Analysis):文本或對話轉錄(conversation transcript)裡流動的語言學分析。
3. 情感分析 (Sentiment Analysis)
4. 事件分析 (Event Analysis)
5. 趨勢分析 (Trend Analysis) / 樣式分析 (Pattern Analysis)
6. 詞法/語法分析 (Lexical / Syntactical Analysis)
7. 關係/連結分析 (Relation / Connection Analysis)
8. 翻譯/文本比對分析 (Translation / Text Alignment Analysis)
視覺化任務則是由文本視覺化技術所支援的較基層呈現與互動任務,包括:
1. 自動凸顯/建議興趣區 (Region of Interest)
2. 群集 (Clustering) / 分類 (Classification / Categorization)
3. 比較 (Comparison)
4. 概觀 (Overview)
5. 監視 (Monitoring)
6. 瀏覽 (Navigation) / 探索 (Exploration)
7. 對於不確定的對策 (Uncertainty Tackling)
資料領域,包括
1. 線上社交媒體 (Online Social media)
2. 通訊 (Communication)
3. 專利 (Patents)
4. 評論 (Reviews) / 病歷 (Medical Records)
5. 文學作品 (Literature) / 詩 (Poems)
6. 科學文章 (Scientific Articles) / 論文 (Papers)
7. 社論媒體 (Editorial Media)
資料來源有單一文件 (Document) [33]、語料庫 (Corpora) [25]以及 串流文本 (Streams) [19];特殊的資料性質包括地理空間 (Geospatial) [11]、時間序列 (Timeseries) [14] 以及網路 (Networks) [6];視覺化的再現包括下列項目:折線圖 (Line Plot) / 河流圖 (River) [9, 18]、像素 (Pixel) / 面積 (Area) / 矩陣 (Matrix) [13, 7, 4]、節點-連結 (Node-Link) [32]、雲 (Clouds) / 銀河 (Galaxies) [1, 3]、地圖 (Maps) [34]、文本 (Text) [26]與形符 (Glyph) / 圖標 (Icon) [28, 10];排列則包括了輻射狀 (Radial) [35]、線性 (Linear) / 平行線 (Parallel) [8] 以及測標依賴 (Metric-dependent) [22]。
本研究指出有超過一半(56%)的文本視覺化利用主題模型(topic modeling)技術,資料來源方面大多數支援語料庫(70%),並且許多支援時間相關的資料(43%),而視覺再現方面主題以二維(2-D)為主,僅有極少數的研究以三維(3-D)的方式呈現,約占所有研究的4%。
文本視覺化的前五位主要作者為Daniel A. Keim (17 筆)、Shixia Liu (12 筆)、Christian Rohrdantz (9 筆)、Daniela Oelke (7 筆)和 Huamin Qu (7 筆)。將作者依據他們的合著關係建立研究者合作網路圖後,觀察網路圖的相連成分,可以發現大部分是獨立的小群體,最大的成分上共有106位作者,並且在這個成分上的兩個主要集群為University of Konstanz和Microsoft Research Asia等兩個研究團隊,Daniel A. Keim 和 Shixia Liu分別為集群的中心,並且他們二位也是網路圖上中介中心性最高的節點。雖然在本研究蒐集的資料上,這兩位作者之間並沒有直接的合作關係,但他們都曾與中介中心性第三高的兩位作者Dongning Luo 和 Jing Yang合作。
In this paper, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.
The interest for text visualization and visual text analytics has been increasing for the last ten years. The reasons for this development are manifold, but for sure the availability of large amounts of heterogeneous text data (caused by the popularity of online social media) and the adoption of text processing algorithms (e.g., for topic modeling) by the InfoVis and Visual Analytics communities are two possible explanations.
Analytic Tasks
these items are critical to the main analysis goals that users expect to achieve when employing a text visualization technique.
1. Text Summarization / Topic Analysis / Entity Extraction
2. Discourse Analysis
the linguistic analysis of the flow of text or conversation transcript.
3. Sentiment Analysis
for techniques related to the analysis of sentiment, opinion, and affection.
4. Event Analysis
deal with the extraction of events from the text data or involve visualization of text in some different manner
5. Trend Analysis / Pattern Analysis
both automated trend analysis and manual investigation directed at discovering patterns in the textual data.
6. Lexical / Syntactical Analysis
7. Relation / Connection Analysis
8. Translation / Text Alignment Analysis
Visualization Tasks
lower-level representation and interaction tasks that are supported by the text visualization techniques.
1. Region of Interest
the automatic highlighting/suggestion of data items/regions that could be of interest to the user for more detailed investigation
2. Clustering / Classification / Categorization
3. Comparison
4. Overview
both techniques that provide “the big picture” by displaying a significant portion of the data set as well as techniques which use special aggregated representations to provide overview while reducing the visual complexity
5. Monitoring
6. Navigation / Exploration
7. Uncertainty Tackling
Domain
1. Online Social media
2. Communication
3. Patents
4. Reviews / (Medical) Records
5. Literature / Poems
6. Scientific Articles / Papers
7. Editorial Media
Data sources include the following self-evident items: Document [33], Corpora [25], and Streams [19].
The special data properties include Geospatial [11], Timeseries [14], and Networks [6].
Representation includes the following items: Line Plot / River [9, 18], Pixel / Area / Matrix [13, 7, 4], Node-Link [32], Clouds / Galaxies [1, 3], Maps [34], Text [26], and Glyph / Icon [28, 10].
Alignment, i.e., layout, includes Radial [35], Linear / Parallel [8], and Metric-dependent [22].
As displayed in the table, our proposed taxonomy includes most of the categories except for two: we believe that the underlying data representation (e.g., bag-of-words vs. language model [30] or whole text vs. partial text [24]) is more relevant to the underlying computational methods than to observable visualization techniques.
And the same naturally holds for data processing methods (e.g., the specification of involved MDS methods [2]) that are partially covered by other categories in our taxonomy, for instance, the analytic task of topic analysis implies the usage of corresponding computational methods.
Using the data collected for the survey, we have been able to analyze the general state of the text visualization field, to compare the usage of various analysis and visualization techniques (with regard to our taxonomy), and to analyze the information about researchers in this field.
According to our current set of entries, the trend for rapid increase of text visualization techniques started around 2007.
With regard to category statistics (cf. Fig. 4), there is an obvious interest for tasks related to topic modeling (56% of all entries).
The majority of the techniques support corpora as data sources (70% of all entries), and a lot of them support time-dependent data (43% of all entries).
Another result—which is probably expected—is that only less than 4% of all entries use 3-dimensional visual representations.
We have also taken a look at the authorship statistics for the current data set. The top five authors with regard to number of techniques are Daniel A. Keim (17 entries), Shixia Liu (12 entries), Christian Rohrdantz (9 entries), Daniela Oelke (7 entries), and Huamin Qu (7 entries).
As seen in Fig. 5, the majority of author nodes are included into isolated connected components of small sizes (less than 10 nodes) while there is a big connected component with 106 nodes present in the graph.
The two major clusters in that component represent the research groups from the University of Konstanz and Microsoft Research Asia with Daniel A. Keim and Shixia Liu as cluster center nodes.
Shixia Liu and Daniel A. Keim happen to have the 1st and the 2nd largest betweenness values in the graph, respectively. While these two researchers have no direct collaboration with regard to our data set, they both have collaborated with Dongning Luo and Jing Yang who both share the 3rd largest betweenness value.
3. 情感分析 (Sentiment Analysis)
4. 事件分析 (Event Analysis)
5. 趨勢分析 (Trend Analysis) / 樣式分析 (Pattern Analysis)
6. 詞法/語法分析 (Lexical / Syntactical Analysis)
7. 關係/連結分析 (Relation / Connection Analysis)
8. 翻譯/文本比對分析 (Translation / Text Alignment Analysis)
視覺化任務則是由文本視覺化技術所支援的較基層呈現與互動任務,包括:
1. 自動凸顯/建議興趣區 (Region of Interest)
2. 群集 (Clustering) / 分類 (Classification / Categorization)
3. 比較 (Comparison)
4. 概觀 (Overview)
5. 監視 (Monitoring)
6. 瀏覽 (Navigation) / 探索 (Exploration)
7. 對於不確定的對策 (Uncertainty Tackling)
資料領域,包括
1. 線上社交媒體 (Online Social media)
2. 通訊 (Communication)
3. 專利 (Patents)
4. 評論 (Reviews) / 病歷 (Medical Records)
5. 文學作品 (Literature) / 詩 (Poems)
6. 科學文章 (Scientific Articles) / 論文 (Papers)
7. 社論媒體 (Editorial Media)
資料來源有單一文件 (Document) [33]、語料庫 (Corpora) [25]以及 串流文本 (Streams) [19];特殊的資料性質包括地理空間 (Geospatial) [11]、時間序列 (Timeseries) [14] 以及網路 (Networks) [6];視覺化的再現包括下列項目:折線圖 (Line Plot) / 河流圖 (River) [9, 18]、像素 (Pixel) / 面積 (Area) / 矩陣 (Matrix) [13, 7, 4]、節點-連結 (Node-Link) [32]、雲 (Clouds) / 銀河 (Galaxies) [1, 3]、地圖 (Maps) [34]、文本 (Text) [26]與形符 (Glyph) / 圖標 (Icon) [28, 10];排列則包括了輻射狀 (Radial) [35]、線性 (Linear) / 平行線 (Parallel) [8] 以及測標依賴 (Metric-dependent) [22]。
本研究指出有超過一半(56%)的文本視覺化利用主題模型(topic modeling)技術,資料來源方面大多數支援語料庫(70%),並且許多支援時間相關的資料(43%),而視覺再現方面主題以二維(2-D)為主,僅有極少數的研究以三維(3-D)的方式呈現,約占所有研究的4%。
文本視覺化的前五位主要作者為Daniel A. Keim (17 筆)、Shixia Liu (12 筆)、Christian Rohrdantz (9 筆)、Daniela Oelke (7 筆)和 Huamin Qu (7 筆)。將作者依據他們的合著關係建立研究者合作網路圖後,觀察網路圖的相連成分,可以發現大部分是獨立的小群體,最大的成分上共有106位作者,並且在這個成分上的兩個主要集群為University of Konstanz和Microsoft Research Asia等兩個研究團隊,Daniel A. Keim 和 Shixia Liu分別為集群的中心,並且他們二位也是網路圖上中介中心性最高的節點。雖然在本研究蒐集的資料上,這兩位作者之間並沒有直接的合作關係,但他們都曾與中介中心性第三高的兩位作者Dongning Luo 和 Jing Yang合作。
In this paper, we present an interactive visual survey of text visualization techniques that can be used for the purposes of search for related work, introduction to the subfield and gaining insight into research trends.
The interest for text visualization and visual text analytics has been increasing for the last ten years. The reasons for this development are manifold, but for sure the availability of large amounts of heterogeneous text data (caused by the popularity of online social media) and the adoption of text processing algorithms (e.g., for topic modeling) by the InfoVis and Visual Analytics communities are two possible explanations.
Analytic Tasks
these items are critical to the main analysis goals that users expect to achieve when employing a text visualization technique.
1. Text Summarization / Topic Analysis / Entity Extraction
2. Discourse Analysis
the linguistic analysis of the flow of text or conversation transcript.
3. Sentiment Analysis
for techniques related to the analysis of sentiment, opinion, and affection.
4. Event Analysis
deal with the extraction of events from the text data or involve visualization of text in some different manner
5. Trend Analysis / Pattern Analysis
both automated trend analysis and manual investigation directed at discovering patterns in the textual data.
6. Lexical / Syntactical Analysis
7. Relation / Connection Analysis
8. Translation / Text Alignment Analysis
Visualization Tasks
lower-level representation and interaction tasks that are supported by the text visualization techniques.
1. Region of Interest
the automatic highlighting/suggestion of data items/regions that could be of interest to the user for more detailed investigation
2. Clustering / Classification / Categorization
3. Comparison
4. Overview
both techniques that provide “the big picture” by displaying a significant portion of the data set as well as techniques which use special aggregated representations to provide overview while reducing the visual complexity
5. Monitoring
6. Navigation / Exploration
7. Uncertainty Tackling
Domain
1. Online Social media
2. Communication
3. Patents
4. Reviews / (Medical) Records
5. Literature / Poems
6. Scientific Articles / Papers
7. Editorial Media
Data sources include the following self-evident items: Document [33], Corpora [25], and Streams [19].
The special data properties include Geospatial [11], Timeseries [14], and Networks [6].
Representation includes the following items: Line Plot / River [9, 18], Pixel / Area / Matrix [13, 7, 4], Node-Link [32], Clouds / Galaxies [1, 3], Maps [34], Text [26], and Glyph / Icon [28, 10].
Alignment, i.e., layout, includes Radial [35], Linear / Parallel [8], and Metric-dependent [22].
As displayed in the table, our proposed taxonomy includes most of the categories except for two: we believe that the underlying data representation (e.g., bag-of-words vs. language model [30] or whole text vs. partial text [24]) is more relevant to the underlying computational methods than to observable visualization techniques.
And the same naturally holds for data processing methods (e.g., the specification of involved MDS methods [2]) that are partially covered by other categories in our taxonomy, for instance, the analytic task of topic analysis implies the usage of corresponding computational methods.
Using the data collected for the survey, we have been able to analyze the general state of the text visualization field, to compare the usage of various analysis and visualization techniques (with regard to our taxonomy), and to analyze the information about researchers in this field.
According to our current set of entries, the trend for rapid increase of text visualization techniques started around 2007.
With regard to category statistics (cf. Fig. 4), there is an obvious interest for tasks related to topic modeling (56% of all entries).
The majority of the techniques support corpora as data sources (70% of all entries), and a lot of them support time-dependent data (43% of all entries).
Another result—which is probably expected—is that only less than 4% of all entries use 3-dimensional visual representations.
We have also taken a look at the authorship statistics for the current data set. The top five authors with regard to number of techniques are Daniel A. Keim (17 entries), Shixia Liu (12 entries), Christian Rohrdantz (9 entries), Daniela Oelke (7 entries), and Huamin Qu (7 entries).
As seen in Fig. 5, the majority of author nodes are included into isolated connected components of small sizes (less than 10 nodes) while there is a big connected component with 106 nodes present in the graph.
The two major clusters in that component represent the research groups from the University of Konstanz and Microsoft Research Asia with Daniel A. Keim and Shixia Liu as cluster center nodes.
Shixia Liu and Daniel A. Keim happen to have the 1st and the 2nd largest betweenness values in the graph, respectively. While these two researchers have no direct collaboration with regard to our data set, they both have collaborated with Dongning Luo and Jing Yang who both share the 3rd largest betweenness value.
沒有留言:
張貼留言