近年來從文本串流中偵測事件已成為熱門的研究領域,然而由於對事件的概念沒有妥善的定義以及文本資料的種類繁多,對資料分析與視覺化是一個重大的挑戰,因此能夠處理特定事件類型與多樣性文字來源的視覺分析工具的建立準則十分缺乏。在本研究中,將事件視為是從文本資料中抽取出的對使用者有價值的非預期而獨特的樣式(unexpected and unique patterns),並且建議由新聞標準(news criteria)或新聞價值(news values)[GR65]來界定事件的價值。
在從文本資料串流利用視覺分析進行事件偵測的研究中,資料的來源從有限而書寫良好的新聞文章到社交媒體上由使用者書寫、快速產生甚至有時沒結構的文字資料;而分析任務則可依其目的分為新事件偵測(new event detection)、事件追蹤(event tracking)、事件摘要(event summarization)以及事件關聯(event associations) (Dou et al., DWRZ12)。Becker [Bec11]對於社交媒體上的事件偵測研究,將事件依據3個面向區分:1) 計畫內 (planned) vs. 無計畫 (unplanned)、 2) 趨勢 (trending) vs 非趨勢 (non-trending)、 3) 外源 (exogenous) vs. 內源 (endogenous)。
本研究以Figure 1上的流程圖表示事件偵測與探索的處理過程,首先在輸入文件資料的前處理 。在前處理之後的方法,則可分為兩類:一類首先應用自動化方法偵測資料裡的事件,然後再利用這些資訊做為視覺分析(visual analysis)的介面;另一類則是直接對前處理後的結果進行視覺化,不進行事件的自動化分析。
以下分別說明文本資料來源、文本處理方法等技術分析的面向。
文字資料來源包括:新聞、電子郵件、部落格、RSS feed、微網誌 (microblogging) 訊息、論壇(forum)上的發文(post)、客服表單、影像與視頻串流上附註的文字。目前有大半的研究是針對微網誌資料,例如Twitter,而除了文字以外,微網誌上的地理位置和作者等後設資料也是許多研究會加以利用的。
在文句偵測(sentence detection)、(tokenizing)、詞幹化(stemming)和(lemmatizing)等文本處理方法之後,進行較深入的詞類標示(part-of-speech tagging)、語法剖析(syntactic parsing)、文句中詞語關係的類型剖析 (Typed-dependency parsing)、相互指涉解析(coreference resolution)、專有名詞辦認 (named entity recognition)、極性抽取 (polarity extraction)、歧義消除 (word sense disambiguation)等。自2000到2011年,33篇事件偵測的相關論文只有17篇利用文本處理方法,但在2012年後,這個情形改變了,18篇論文中便有14篇論文使用文本處理方法,主要是詞類標示和極性抽取。
事件偵測的自動化方法中常用的技術可分為1)群集為基礎、2)以分類為基礎、3)以統計為基礎、4)以預測為基礎、5)本體論為基礎、6)模式探勘(pattern mining)、7)資料串流重複特徵的模型、8)規則式等類型。
以群集為基礎的方法將文件依據內容的不同特性分群,當群組改變時便是事件產生。
分類則是根據事件建立分類器,當文件的分類結果為事件相關,便視為是事件產生。
統計方法中,相關分析類的方法測量文件集合在詞語或詞語與時間上的相關性改變來偵測事件,另一類則是從稀少或獨特的詞語出現來發現事件。
預測為基礎的方法根據過去的歷史預測接下來文件的出現情形。
本體論(ontologies)為基礎的方法適合單一領域的事件偵測,以全自動或半自動方法產生特定本體,當偵測到活躍概念(activated concepts)中的改變時便是可能的事件。
模式探勘(pattern mining)利用A-priori 演算法抽取文件串流上的常見連續模式。
資料串流重複特徵的模型可以用來偵測事件,當一個串流明顯偏離它預期的特徵時便是偵測到一個事件。
規則為基礎的方法以人工編寫的規則偵測事件,例如根據詞語或詞頻為規則來偵測特定的事件。
在視覺化呈現上,以時間為基礎的視覺化呈現佔有大多數,共21篇論文,利用包括河流 (river)、時間線 (timeline)與圓形 (circular)等時間為基礎的呈現凸顯資料在時間上的演變。時間線的呈現會將符號放置在一或多條時間線上,表現資料項目、密度與數量,能夠表現單一或稀少的事件。河流通常用來做為群集演算法結果的視覺化,提供各群集的分布與整體的數量,較著重在高頻率的事件上。以微網誌為資料來源的應用,通常會利用微網誌上附加的地理參考資料,以地圖的方式來呈現。折線圖或長條圖等基本的視覺化呈現方式通常運用來表現事件有關的資料在時間上的數量與頻率。在各種視覺化的應用中,文字資料的呈現通常選用具有意義的關鍵詞。
在支援的分析任務上,包括 1) 提供文件集合的概觀,描述集合內發現的主題,利於進一步的分析。2) 關鍵詞語搜尋以及利用後設資訊過濾資料,在分析或視覺化時減少資料或抽取的事件數量。3) 監測資料來源中事件在時間上的發展。4) 將偵測到的不同事件間的關係視覺化。
質性的評估方法包括案例研究(case study)、使用性評估(usability evaluation)、使用案例(use case)以及軼事評估(anecdotal evaluation),其中使用案例最為盛行。較常見的量化評估方式比較偵測到的事件與真實的資料。
Event detection from text data streams has been a popular research area in the past decade.資料串流重複特徵的模型可以用來偵測事件,當一個串流明顯偏離它預期的特徵時便是偵測到一個事件。
規則為基礎的方法以人工編寫的規則偵測事件,例如根據詞語或詞頻為規則來偵測特定的事件。
在視覺化呈現上,以時間為基礎的視覺化呈現佔有大多數,共21篇論文,利用包括河流 (river)、時間線 (timeline)與圓形 (circular)等時間為基礎的呈現凸顯資料在時間上的演變。時間線的呈現會將符號放置在一或多條時間線上,表現資料項目、密度與數量,能夠表現單一或稀少的事件。河流通常用來做為群集演算法結果的視覺化,提供各群集的分布與整體的數量,較著重在高頻率的事件上。以微網誌為資料來源的應用,通常會利用微網誌上附加的地理參考資料,以地圖的方式來呈現。折線圖或長條圖等基本的視覺化呈現方式通常運用來表現事件有關的資料在時間上的數量與頻率。在各種視覺化的應用中,文字資料的呈現通常選用具有意義的關鍵詞。
在支援的分析任務上,包括 1) 提供文件集合的概觀,描述集合內發現的主題,利於進一步的分析。2) 關鍵詞語搜尋以及利用後設資訊過濾資料,在分析或視覺化時減少資料或抽取的事件數量。3) 監測資料來源中事件在時間上的發展。4) 將偵測到的不同事件間的關係視覺化。
質性的評估方法包括案例研究(case study)、使用性評估(usability evaluation)、使用案例(use case)以及軼事評估(anecdotal evaluation),其中使用案例最為盛行。較常見的量化評估方式比較偵測到的事件與真實的資料。
However, data analysts and visualization experts often face grand challenges stemming out of the ill-defined concept of event and various kinds of textual data. As a result, we have few guidelines on how to build successful visual analysis tools that can handle specific event types and diverse textual data sources.
Within this paper, events are regarded as unexpected and unique patterns extracted from text data streams, valuable to users.
In particular, data sources evolved from a relatively limited amount of well-written news articles to rapidly generated, user written, and in some cases unstructured textual data from social media services.
Dou et al. [DWRZ12] defined task according to “New Event Detection”, “Event Tracking”, “Event Summarization”, and “Event Associations”, but we expect that tasks can be even more diversified including geographic dimension which were not explored yet.
Another challenge represents the unstructured, diverse textual data. It mandates extensive processing and preparation in order to properly employ it.
Becker [Bec11] shows interesting work about event detection in social media. She divides an event using three dimensions: 1) “planned” vs. “unplanned”; 2) “trending” vs “non-trending”; 3) “exogenous” vs. “endogenous”. The last dimension aims to detect events within the data in a real-life context.
Some examples of visual social media analysis is shown in Schreck and Keim [SK13]. With screenshots of the different visualizations, the authors explain the underlying data, analysis methods, and functionality of various applications in visual social media analysis.
There exists a survey on semantic sensemaking by Bontcheva and Rout [BR12]. Though their focus was on the semantic aspects, a subsection refers to visualization approaches.
Rohrdantz et al. [ROKF11] mention tasks for the “RealTime Visualization of Streaming Text Data”. They call tasks that are relevant in terms of the scope of our paper “monitoring”, “change and trend detection” and “situational awareness”.
In the first step of the pipeline, the documents are prepared for the analysis. In this step the documents are parsed to get the plain texts and standard text preprocessing methods, such as sentence detection, tokenizing, and stemming and lemmatizing are applied. In addition to these standard methods, methods from the computer linguistic field can be used in the preprocessing step to annotate the texts with additional information. For instance, part-of-speech tagging, named entity extraction, or syntactic parsing can be used to identify types of words, persons and places, or structure of sentences.
After the preprocessing step different approaches are used to detect events (see two branches in Processing in Figure 1).
The first group of approaches applies automatic methods to detect patterns in the data. The detected patterns are then used to create a visual analysis interface for the data set, what we call visual analysis. The interaction between visualization and the automatic part shapes a visual analytics approach.
The second group of approaches skips the automatic analysis and directly visualizes the outcome of the preprocessing, what also is only visual analysis because of the lack of interaction possibilities of a certain extent.
Text Data Sources
1. News is a well-known text data source. News captures information of a real world event or happening. It consists of a title, often followed by a short summary and the body containing details about the event. News goes through a professional gatekeeping process which in the end forms the agenda of media.
2. A typical electronic document is email. ... Emails are used for personal conversations, advertisement or business information exchange. They consist of a header and a body. The header contains information about transaction: sender, receiver, timestamp, and other meta data. The body contains the textual content of the email. An email body can be of arbitrary length which is one of its characteristics.
3. Weblogs, shortly named blogs are used for information purposes of a more or less undefined audience. ... A blog can have a specific topic or can be open for various topics.
4. RSS feeds are a standardized format to broadcast short news snippets. They consist of a title and a description. RSS feeds can be used by news agencies, newspapers and blogs. ... The standardized format allows the easy integration into other applications.
5. Recently, microblogging providers are becoming more and more popular. The messages are limited with respect to their length of 140 characters. So-called “hashtags” are used in order to characterize the membership of a tweet to a certain topic. In addition, more meta data is provided, e.g. geolocation, author, place etc.
6. User forums often have hierarchical structure. A message within the forum is a post and is not strictly restricted with respect to its length. Posts which belong to the same topic shape a so called thread. On the other hand several threads often belong to a sub-forum within the main forum. The purpose of a forum is the discussion on specific issues and topics regarding the its main topic.
7. Modern customer-care systems often ask each customer to fill out a feedback form after a purchase. This form (often digital, reachable through the internet) gives the customer the opportunity to provide issues directly to the vendor. The information is a valuable source which allows the seller to react fast and adequately to issues being raised by customers. ... Often these forms are semi-structured, which means they have checkboxes for predefined questions and provide a free text field for further comments.
8. Images and video sequences can be uploaded on sharing sites such as Flickr (https://www.flickr.com/). Users can tag their content with text. These tags and little text snippets typically describe the content in a short manner or express an emotional state being associated with the photo.
Almost half of the papers use microblogging data namely Twitter. It is obvious in Table 1 that in 2010 a shift towards microblogging happened. It is also noticeable that meta data (geolocations, author information) is often used in conjunction with microblogs.
Text Processing Methods
1. Part-of-speech (POS) tagging detects the word type of tokens.
2. Syntactic parsing determines the grammatical structure of sentences. ... Full syntactic parsing uses grammars and build up a complete parse tree for a sentence. ... Shallow parsing creates meaningful chunks and avoids the complexity of full parsing.
3. Typed-dependency parsing determines the type of relations between words in a sentence.
4. Coreference resolution creates connection between referring expression, such as pronouns, and subjects in a text. A correct resolution of referring expressions could improve text mining results, e.g., polarity extraction would benefit from correctly resolved referring expressions.
5. Named entity recognition (NER) detects and labels names of, e.g., persons, locations, events, or dates in texts.
6. Polarity extraction or determines the attitude (positive vs. negative) of the writer about a subject.
7. Word-sense disambiguation techniques use the context of words to determine the correct sense of tokens. ... We only observed one paper using word sense disambiguation.
We confirm that text processing methods are used very sparingly. ... Since 2000 until the end of 2011, only 17 out of 33 papers utilized any of the methods. The 16 papers with no text processing methods solved the event detection tasks with visualization. In the year of 2012, the trend changed dramatically; 14 out of 18 papers have used text processing methods in the papers published since then.
It is also noticeable that part-of-speech tagging and polarity extraction have gained popularity since 2012 as well.
Thus, we believe that many research papers started absorbing more natural language processing techniques to further generate their event metrics.
Automatic Methods for Text Event Detection
1. Clusters are generated for different time windows based different properties in the document, e.g., co-occurrence of terms, frequency in time, or metadata. Events are generated when the set of clusters changes, e.g., a new cluster arise or two existing clusters merge.
2. Users provide a set of example documents and classifiers learn to detect the annotated events. Classifier-based techniques are used in similar cases with rule-based ones, but have the advantage that users do not need to create rules by themselves.
3. Statistical methods such as correlation or detection of outliers and significant difference are used to identify events. Correlation based methods examine collection between terms or between terms and time and detect events by changes in the correlation measures. A different type of statistical methods calculate term-wise deviation from an expected value or use other measures to identify rare or unique occurrence of terms.
4. Prediction-based methods predict the occurrence of following documents based upon past history.
5. Methods based on ontologies [HHSW09] are suitable for event analysis in single domains. Specific ontologies are generated with full- or semi-automatic methods. ... Using this type of methods, events can then be detected from changes in activated concepts.
6. Pattern mining algorithms, such as the A-priori algorithm of Wu and Chen [WC09] applied to text in [WSJ∗ 14], are used to extract common sequential patterns in document streams. Patterns can be found based on documents themselves or time intervals. In both cases, features extracted from documents are then used to define patterns.
7. Models of the recurring characteristics of data steams can be used to detect events. An event is detected when a stream deviates significantly from its expected characteristics.
8. Rule-based approaches detect events with manually created rules. For instance, users specify rules based on terms and/or frequency to detect a particular event.
Visualization of Events in Text Data
In total, 21 papers use a time-oriented visualization (river, timeline, circular) to visualize the evolution of the data over time. Time-oriented visualizations are often combined with additional visualizations to show non-time dependent information.
Maps visualization came up with microblogging data and use mainly geographic references in the meta information of the microblogs for visualization.
A problem for all visualizations is the question how to visually represent text data. This problem is usually solved by selecting meaningful keywords that are either generated by frequency or by another scoring technique such as topic models.
Basic visualizations (e.g. line or bar charts) are mainly used to give an overview of the data set by showing the time dependent relations of events. They are used to visualize the data volumes or frequencies over time, for instance, of detected topics, named entities, or keywords.
Timeline visualization use one or multiple timelines and place glyphs or shapes on these timelines to indicate single data items, densities, or volumes. Timeline visualizations are therefore preferred over river visualizations when single or rare items should be tracked, because a river visualization put the focus on high frequent events.
River metaphors are often used to visualize outcomes of cluster algorithms. Although timeline techniques could be used, rivers provide a space saving overview and give a better visual impression of the distributions of the clusters and the overall amount of data.
Supported Analysis Tasks
1. Overview visualization give users a summary of the document collection. Common are textual summaries based on frequent terms or topic models that describe the topics found in the collection. These summaries serve as navigation support and are often used as starting point for further analysis.
2. It is also common to provide users with abilities to search for keywords or allow filtering of the data by meta information. Both tasks reduce the number of item or extracted events in the analysis or visualization.
3. Monitoring tasks are the second most frequent tasks supported by the surveyed systems. Users monitoring a data source are interested in the evolution of events in a changing data source. Time-based visualizations (e.g., timeline, river, circular) are often used for monitoring task, because they show the temporal development of events in data sources.
4. In many cases relations between different detected events are visualized. The most frequent shown relations are relation in content, time, and volume. For instance, a river visualization shows time and volume relations between different streams and with additional annotations also relations in content can be shown.
The statistical methods are combined with any type of visualizations.
Interestingly, clustering methods are often visualized by river visualizations.
Exceptionally, topic modeling techniques are not only used with time dependent visualizations but also with other visualizations such as treemaps or geographic visualizations. This pattern appears because topic models are clustering methods that return a ranked list of terms representing single topics, which are often used in visualizations to label data and find names for clusters.
Evaluation
We subdivide qualitative methods into the following categories: case study, usability evaluation, use case, and anecdotal evaluation.
Table 7 accentuates the popular usage of use cases; except for 16 of all considered papers the authors make use of this method. Typically, a use case validates through the description of a fictitious scenario that pinpoints main features whereas a case study involves a domain expert and therefore is more time-consuming [DNKS10,MBB∗ 11].
Anecdotal evaluation describes how the suggested system could be used, but do not provide sufficient evidence to judge the general efficacy of the presented technique.
Usability evaluations involve users performing particular tasks with the given system and asks for comments on usability.
The most prominent quantitative evaluation methods are comparisons of the detected events with a ground truth set. Often event databases are used as ground truth that are enriched by the authors with missing entries.
A different evaluation form of algorithms are comparison with existing algorithms and reporting quality measures. In some cases not the results of the algorithms are evaluated but the performance in the sense of runtime or memory consumption is assessed, which is important for systems working in near real-time scenarios.
We also found only four papers using a user study for evaluation. We expected more papers using user studies, because many systems present novel visualization techniques and user studies can verify the strength and weakness of the application [HHN00,LYK∗ 12,RHD∗ 12].
One thing we noticed was that data sources have dramatically changed from news to social media since 2010. Mainly due to the burst of social media, many research studies used text data streams generated out of Facebook or Twitter.
Some data sources – like for instance discussion forums – are underused than others. Discussion forums are traditional methods to collect opinions from many people, but few research topics investigate data because they are asynchronous and slow to build up in nature. Despite these limitations, they also have a strength: archival history of some topics. Several discussion forums include years of textual conversation between multiple users on a single topic. For instance, this longitudinal conversation can be used to detect certain noticeable shifts in a specific user group’s opinions on political issues over some months or years.
More importantly, visualizations were primarily used as presentation, but had no interaction possible to steer the underlying data processing algorithm in order to further analyze data in a different angle. This limitation can prevent users from providing their insights back into the visualizations.
Especially for news, news criteria (also known as news values) [GR65, HO01] can help find and develop new features for content-based feature detection. They are only mentioned once in our whole bulk of surveyed news analysis research papers [DNKS10].
According to [GR65], news criteria are: frequency, threshold, unambiguity, meaningfulness, consonance, unexpectedness, continuity, composition, reference to elite nations, reference to elite people, reference to persons, and reference to something negative.
In general, news and selection criteria could be merged into one concept we call event values. Event values are a concept including the text data producer’s and user’s perspectives. They could be implemented in the data analysis process by means of new features (feature engineering) and interactive elements, which comes along with the call for more visual analytics functionality.
沒有留言:
張貼留言