information visualization
本論文認為因為需要處理大量的文字資料、處理時要能快速以及呈現結果時需要生動並且能夠理解等三種需要,建議利用文字探勘(text mining)技術以及書目計量指標(bibliometric indicators)分析大量的文字資料庫,產生一系列的技術地圖(technology maps)和創新指標(innovation indicators)。其中產生各種技術地圖的技術包括在圖形上將資料項目對映到適合位置的MDS技術以及連結相關項目對映節點的路徑消除(path-erasing)演算法,並且本論文也建議使用詞語的共現資訊做為資料項目間相關程度的評估參考。
Three factors could enhance managerial utilization: capability to exploit huge volumes of available information, ways to do so very quickly, and informative representations that help manage emerging technologies.
Empirical analysis of emerging technologies poses a number of challenges to analysts. In particular, we note the need to:
1. digest enormous amounts of available information,
2. do so rapidly,
3. present findings vividly and understandably.
1. digest enormous amounts of available information,
2. do so rapidly,
3. present findings vividly and understandably.
A third hard-earned lesson gained from our developmental experiences with ‘‘bibliometrics’’ (counting bibliographic activity) and ‘‘text mining’’ has been that TF-related results must be easily understood and must directly relate to a user’s perceived information needs.
This paper reports on efforts to address these three factors via partially automated processes to generate helpful knowledge from text quickly and graphically. We first illustrate a process to generate a family of technology maps that help convey emphases, players, and patterns in the development of a target technology. Second, we exemplify the generation of particular ‘‘innovation indicators’’ that measure particular facets of R&D activity to relate these to technological maturation, contextual influences, and market potential.
In sum, then, we seek to respond to these challenges—analyzing large text resources, rapidly, to generate compelling findings—to enhance TF (including competitive technological intelligence, technology foresight, etc.). Our approach, called technology opportunities analysis (TOA), seeks to facilitate this process by profiling search sets of bibliographic abstracts on technologies of interest.
The TOA process entails these main steps:
1. Search and retrieve text information, typically from large abstract databases.
2. Profile the resulting search set. VantagePoint applies a combination of machine learning, statistics, and natural language processing to yield what van Raan (1992) call a mix of ‘‘one-dimensional’’ descriptions (lists) and ‘‘two-dimensional’’ relationships (matrices). Profiling may focus on documents. Or, it may focus on concepts (e.g., principal components analysis (PCA) to group related terms as conceptual clusters). A third choice is a combination—seeking to link documents to concepts.
3. Extract latent relationships. VantagePoint applies iterative principal components analyses to uncover links among terms and underlying concepts.
4. Represent relationships graphically. Generation of ‘‘mapping’’ and ‘‘indicators’’ are elaborated in the following sections.
5. Interpret the prospects for successful technological development. This typically entails integrating the bibliographic search set analyses with expert domain knowledge (interviews).
1. Search and retrieve text information, typically from large abstract databases.
2. Profile the resulting search set. VantagePoint applies a combination of machine learning, statistics, and natural language processing to yield what van Raan (1992) call a mix of ‘‘one-dimensional’’ descriptions (lists) and ‘‘two-dimensional’’ relationships (matrices). Profiling may focus on documents. Or, it may focus on concepts (e.g., principal components analysis (PCA) to group related terms as conceptual clusters). A third choice is a combination—seeking to link documents to concepts.
3. Extract latent relationships. VantagePoint applies iterative principal components analyses to uncover links among terms and underlying concepts.
4. Represent relationships graphically. Generation of ‘‘mapping’’ and ‘‘indicators’’ are elaborated in the following sections.
5. Interpret the prospects for successful technological development. This typically entails integrating the bibliographic search set analyses with expert domain knowledge (interviews).
We have developed a partly automated process to do so based on ‘‘co-occurrence’’ information. Co-occurrence is based on the pattern of terms occurring together in the records. If two terms occur together in the records more frequently than expected, there is a presumption of relationship between them. Terms can include authorship (also organizational affiliation, nationality) or ‘‘keywords’’ (subject index terms), or noun phrases generated from titles or abstracts using our natural language processing (NLP) routine (cf., Refs. [18,20]).
Effective visualization of the basic co-occurrence and correlation matrix information entails a sequence of analyses:
1) a new two-step multidimensional scaling (MDS) algorithm,
2) an improved path-erasing algorithm,
3) a routine to determine and display size (relative frequency of occurrence),
4) macros to create maps in VantagePoint, Microsoft Word or MS PowerPoint,
5) a routine to consolidate duplicate principal components (in the mapping process),
6) an algorithm to automatically name principal components,
7) an algorithm to cut off principal components to just include high-loading terms (the last three steps are needed for principal components maps; cf., Refs. [16,18]),
1) a new two-step multidimensional scaling (MDS) algorithm,
2) an improved path-erasing algorithm,
3) a routine to determine and display size (relative frequency of occurrence),
4) macros to create maps in VantagePoint, Microsoft Word or MS PowerPoint,
5) a routine to consolidate duplicate principal components (in the mapping process),
6) an algorithm to automatically name principal components,
7) an algorithm to cut off principal components to just include high-loading terms (the last three steps are needed for principal components maps; cf., Refs. [16,18]),
Our routine generates various maps, such as:
1. principal components map [represents the relationships among conceptual clusters];
2. keywords map [represents the relationships among frequently occurring subject index terms, title phrases, or whatever terms are chosen];
3. affiliations map [represents the relationships of affiliations’ research topics, based on terms they use in their documents—see Fig. 1];
4. authors map [analogous to affiliations map, but for individual researchers];
5. countries map [analogous to affiliations map];
6. sources (e.g., journals) map [analogous to affiliations map].
1. principal components map [represents the relationships among conceptual clusters];
2. keywords map [represents the relationships among frequently occurring subject index terms, title phrases, or whatever terms are chosen];
3. affiliations map [represents the relationships of affiliations’ research topics, based on terms they use in their documents—see Fig. 1];
4. authors map [analogous to affiliations map, but for individual researchers];
5. countries map [analogous to affiliations map];
6. sources (e.g., journals) map [analogous to affiliations map].
Fig. 1 shows an affiliations (organizations) map for the ‘‘Nanotechnology’’ topic. Displayed are the most prolific publishers abstracted in INSPEC for 1998. Along with the organizational name are shown the three keywords most frequently used in its publications in the search set. The size of a node reflects the number of publications. Positioning is determined using our MDS and path-erasing algorithm.
In essence, the challenge is to reduce n-dimensional (in this case, n equates to 40-dimensional since there are some 40 affiliations’ similarity being represented) to 2-D or 3-D. MDS is the generally favored approach to accomplish this. In MDS, an important parameter called stress is used to control its procedures. The process of generating a MDS map seeks the optimum location for each element in the map by minimizing the stress. ... We have devised a ‘‘step-by-step’’ search algorithm. This algorithm is effective at finding the global stress minimum, although it usually consumes more CPU time than the ‘‘steepest descent’’ algorithm.
Therefore, we have added an additional representational element, connecting links, based on a ‘‘path-erasing’’ algorithm. This is built on a proximity matrix among the elements. Its logic is as follows:
1. connect all elements in the proximity matrix together,
2. set a series of thresholds to erase the connecting lines one by one,
3. devise a suitable stop criterion.
1. connect all elements in the proximity matrix together,
2. set a series of thresholds to erase the connecting lines one by one,
3. devise a suitable stop criterion.
The partially automated processes presented provide ‘‘value-added’’ knowledge from bibliographic text mining. The family of maps allows a user to gain an intuitive feel for R&D activity.
We suggest that development of routines to generate particular representations—technology maps and innovation indicators—automatically can enhance the applicability of text mining and bibliometrics to TF. ... However, scripting the production of these visualizations can facilitate provision of empirically based, vivid TF findings, in a timely manner, to inform decision making. That could dramatically increase the utilization of TF in management of technology
沒有留言:
張貼留言