2013年12月19日 星期四

Polanco, X., Francois, C., Lamirel, J. (2001). Using artificial neural networks for mapping of science and technology: A multi-self-organizing-maps approach. Scientometrics, 51(1), 267-292.

Polanco, X.,  Francois, C.,  Lamirel, J. (2001). Using artificial neural networks for mapping of science and technology: A multi-self-organizing-maps approach. Scientometrics, 51(1), 267-292.

information visualization/self-organizing map

本篇論文提出一個利用multi-SOM為基礎的文件主題知識介面,以一個SOM表現一種描述文件方式的詞語資料(例如:論文中作者姓名或是研究方法相關的詞語),並且利用SOM之間的對映關係,將multi-SOM串聯起來提供研究者在分析時的參考。這篇論文討論了(1)將文件在多維度上的距離關係盡量保留映射到二維平面上的關係,(2)出現在較多文件的詞語會在圖形上佔有較多個節點(也就是較大的面積)等SOM的特性,並且建議使用SOM作為分析參考時的程序,包括(1)在SOM產生後檢視主題的合理性,(2)解釋SOM圖形所表現的訊息,和(3)將不同的SOM圖形串聯起來。這篇論文的研究以植物相關的1843筆專利文件為探討對象,並且以724個詞語作為索引。

According to Kohonen (1997, p. 86), one might say that the SOM is a non-linear projection of the probability density function p(x) of the high-dimensional input data vector x onto the two-dimensional display (i.e., a map). ... The self-organizing map (SOM) gives central attention to spatial order in the clustering of data. The purpose is to compress information by forming reduced representations of the most relevant features, without loss of information about their interrelationships.
In the quantitative studies of science, the Kohonen self-organizing maps have been used for mapping scientific journal networks (Campanario, 1995), and also author cocitation data (White et al., 1998).
In the SOM, the competitive learning means also that a number of nodes is comparing the same input data with their internal parameters, and the node with the best match (say, “winner”) is then tuning itself to that input, in addition the best matching node activates its topographical neighbours in the network to take part in tuning to the same input. More a node is distant from the winning node the learning is weaker.
Like any unsupervised clustering method, the SOM can be used to find clusters in the input data, and to identify an unknown data vector with one of the clusters. Moreover, the SOM represents the results of its clustering process in an ordered two-dimensional space (R2). A mapping from a high-dimensional data space Rnonto a two dimensional lattice of nodes is thus defined. Such a mapping can effectively be used to visualise metric ordering relations of input data.
The SOM takes a set of documents in our case patents as input data (x), each patent is represented by an N-keywords vector (x∈Rn), and maps them onto nodes of a two-dimensional grid (n∈R2).
The main properties of such self-organizing maps are the following: “First, the distance relationships between the input data are preserves by their images in the map as faithfully as possible. While some distortion is unavoidable, the mapping preserves the most important neighbourhood relationships between the data items, i.e., the topology of their distribution. Second, the map allocates different numbers of nodes to inputs based on their occurrence frequencies. If different input vector appear with different frequencies, the more frequent one will be mapped to larger domains at the expense of the less frequent ones” (Ritter and Kohonen, 1989, p. 246).
The algorithm is based on three computational levels: the winning node selection, the unsupervised learning and neighbouring definition, and the control mechanisms of the unsupervised self-organizing algorithm.
Due to the fact that there is obviously no absolute strategy for achieving that goal, the choice has been to implement two different kinds of strategies that could be indifferently used during the map interactive consultation phase. They are respectively called the clusters vector driven strategy and the document vector driven strategy. ... The clusters vector driven strategy consists of attributing to each cluster a name that represents the combination of the labels of the components having the maximum values in its vector. This strategy is well-suited in highlighting for the user the main themes described by the map. ... The document vectors driven strategy consists of attributing to each cluster a name that represents the combination of the labels of the components having the maximum values in either the vector of the most representative member of the cluster or the average document vector computed from all the cluster member vectors.
In order to reach that goal the task that the system operates is to reduce the number of cluster (i.e., the number of nodes) of the map in a coherent way. The method consists in starting from the original map and introducing new clustering levels of synthesis (i.e., maps) by progressively reducing the number of nodes. Since the original map has been build on the basis of a 2D square neighbourhood between nodes, the transition from one level to another is achieved by choosing a new node set in which each new node will represent the average composition of a square of four direct neighbours on the original level. ... This procedure has the advantage of preserving the original neighbourhood structure on the new generated levels. Moreover it ensures the conservation of topographic properties of the map nodes vectors, and consequently the conservation of the closeness of the nodes areas in the generalized maps.
Our empirical example is a set of 1843 patents about vegetal transgenic technology indexed by 724 keywords, and recorded in the period 1978–1997.
In comparison with the standard mapping methods, as such principal component analysis or multidimensional scaling, the advantage of the multi-map displays is the inter-map communication mechanism that Multi-SOM environment provides to user. Each map is representing a viewpoint. Each viewpoint is representing a subject category. The inter-map communication mechanism assisted the user to cross information between the different viewpoints.
On the computer system side, the following task was to build the maps representing the different viewpoints, using the map algorithm. The second was to use the inter-map communication for achieving thematic querying.
The system provides to human analysts a first level analysis with its unsupervised learning approach for extracting from the data the features that the maps display. The second level of analysis is constituted by the tasks that the human analysts should achieve. These tasks can be organized in three successive stages.
First stage: Validation of the Plant Map. Verifying if the clusters of an area really represent the plant instead of the usual host of a pathogen. Examining the cluster positions on the map; Observing the cluster vectors; Considering the relative size of the associated areas.
Second stage: Interpretation of the results. This asks strongly the background knowledge of the domain experts. ... The domain expert also detected ambiguous clusters on the map.
Third stage: Thematic queries using inter-map communication process. The activation of the whole area associated to a plant or a plant group. The use of the inter-map communication of the activity applying a possibilistic parameter along with a bias from the activated documents to the clusters of the target maps. The analysis of all the activated clusters on each of the target maps (i.e., target viewpoint).
Two goals could be achieved at least by the graphic user interface. The first is the working interface for elaborated clean and final results. The second is the visualisation of the clean and final results with browsing and querying functionalities.
The model that this multi-map environment provides is certainly the map but in its original extended version of intercommunication between multiples maps. Each map represents a particular viewpoint extracted from the data. These viewpoints are related either by the problem to be solved, or by the intercommunication mechanism between the maps. We have exposed both the map generation and their intercommunication mechanism. We finally showed how this clustering and mapping environment gives assistance to users in some watching intention.
A reason to use ANNs in quantitative studies of science and technology is their capability to create “higher abstractions from raw data completely automatically. Intelligence in neural networks ensues from abstractions, not from heuristic rules or manual logic programming” (says Kohonen, 1997, p. 65).
The maps play the role of strategic indicators because they provide a comparison way for evaluating the relative position of themes onto an ordered space.

沒有留言:

張貼留言