White, H. D., Lin, X., Buzydlowski, J. W. and Chen, C. (2004). User-controlled mapping of significant literatures. Proceedings of the National Academy of Science of the United States of America, 101, 5297-5302.
本研究使用尋徑者網路(pathfinder networks, PFNET)和自組織映射圖(self-organizing maps, SOM)兩種維度縮減(dimension reduction)技術做為PNAS期刊論文檢所的圖形化介面,輸入一個詞語或一位作者,產生這個查詢與其相關的24個詞語或24位作者的圖形。以Gene Frequency與這個主題的重要作者Slatkin做為查詢所產生的PFNET與SOM,提供Slatkin檢視,他認為產生的圖形很容易解釋。本研究並說明與比較了PFNET和SOM作為資訊視覺化界面的特點。
information visualization
Our data are the contents of PNAS for 1971–2002, as described by medical subject headings from the National Library of Medicine (NLM) and by citation indexing from the Institute for Scientific Information (ISI).
SOMs show frequently co-occurring terms as nodes that are spatially close. PFNETs show them as nodes with explicit ties. The two kinds of maps will be exemplified here with medical subject headings (MeSH) and cocited authors in a specialty of genetics.
In their extensive review, Borner et al. (5) emphasize that ‘‘painting a big picture’’ is a main goal in domain mapping. This may lead to a strategy of mapping very large co-occurrence matrices in their entirety. Indeed, system designers have made many significant developments in software for such global portrayals of literatures, e.g., THEMESCAPE and VXINSIGHT render literatures as landscapes; GALAXIES and STARRYNIGHT render them as astral bodies (10–12).
In global mapping, system designers present the user with a preformed view, often in 3D, of some sizeable literature. Within the panel of visualization, landscapes invite flyovers; star-fields or other constructs invite flythroughs. In the former, peaks representing major accretions of documents on some subject are likely to exert a powerful pull on the user; in the latter, document points coded as important, e.g., by differences in shape, size, or color, exert a similar pull.
Essentially, the user is engaged in old-fashioned browsing, as of book titles in library stacks, but system designers may minimize or even eliminate labeling of objects in the map because labels clutter precious screen space and block the metaphorical presentation (see examples in ref. 12).
The user explores the view by ‘‘visiting’’ or ‘‘homing in on’’ objects of interest, rather as in video games, but typically cannot remap the literature in pursuit of some new interest because a new map takes hours of computer time to create.
Ours, however, is an alternative way of visualizing knowledge domains, the localized mapping. Perhaps the chief difference is that the localized approach relinquishes scope to increase the user’s control of the mapping process.
... our localized system of mapping more closely resembles online searching. The user starts the process by entering a single term at a web interface. This is consistent with the way most people search the web (13) and is intended to minimize cognitive demands on users. The system responds to the entry (or ‘‘seed’’) term by forming a list of the terms that co-occur with it, ranked high to low by frequency. The seed term and its 24 next-highest neighbors are then exhibited as a PFNET or a SOM, which the user can switch between.
If the indexing terms used in the mapping are indeed controlled by a formal thesaurus, our SOMs and PFNETs provide an alternative: they display the top listings in what is sometimes called a term’s associative thesaurus (2).
A map of cocited authors is, in effect, an associative thesaurus of authors linked by conjoint use of their works. Again, these linkages may permit useful retrievals that are not otherwise possible (1).
PFNETs and SOMs are dimension-reduction techniques that have been used to visualize the structure of literatures for more
than a decade.
than a decade.
In the context of the movement joining bibliometrics with document retrieval (2, 5, 10), PFNETs have been described by Fowler and colleagues (15–17), McGreevy (18), and Chen (19, 20). Analogous accounts of SOMs have been done by Lin et al. (21), Roussinov and Chen (22), and Chen et al. (23).
The number of links in a PFNET is controlled by two parameters, r and q.
The parameter r, which determines how path weights are computed, is lucidly explained by Fowler et al. (17): ‘‘Path weight, r, is computed according to the Minkowski r-metric. It is the rth root of the sum of each distance raised to the rth power for all links in a path between two nodes. Although the r-metric is continuously variable, simple interpretations exist only for r =1 (path weight is the sum of the link weights in the path), r=2 (path weight is the Euclidean distance), and r=infinity (path weight equals the maximum link weight in the path). One advantage of r=infinity is that one need only assume that the original distance estimates have ordinal properties. Another advantage is that the link structure will be preserved for anymonotonic transformation of the data.’’
The parameter q sets the range within which all paths of length q will be examined in the test of the triangle inequality (24) and removed if they violate it. The larger the value of q, the more extensive the triangle inequality constraint; therefore, links are more likely on a path that violates the rule. If q is one less than the number of nodes, then all of the potential violators are under scrutiny.
The more frequently co-occurring terms, which presumably have greater mutual relevance, occupy more proximate regions on the map. SOMs are designed to render not just the highest co-occurrence counts between terms, but rather relatively high co-occurrences across groups of terms.
They are a softer-focus kind of mapping than PFNETs, but they, too, suggest specific combinations of terms on which the user might want to base retrievals.
This process of self-organization (also known as unsupervised learning) runs over many iterative cycles. In each iteration, the images of term pairs that are strongly related in the high-dimensional space will be moved closer on the lower-dimensional space until stability is reached.
A row from the cooccurrence matrix ‘‘is randomly selected and compared to every output node to determine a winner. Weights of the winning output nodes then are updated so that the next time this input node is presented, this output node will likely be selected again as the winner. In the meantime, nodes surrounding the winning node are similarly adjusted.
The number of iterations needed to train a SOM is often determined empirically (in our case, we optimize the number of training cycles to 2,500).
After the training, input vectors closest in the input space will map to the same regions in the output map. The regions are delineated by areas of nodes in which the elements with the highest value on the vectors are the same.’’
Adjacent areas reflect stronger relationships than nonadjacent areas. Terms in large areas are more influential than terms in small areas.
Slatkin found his own cocited author maps readily interpretable. He was acquainted with every name that appears in Fig. 2. In the PFNET (which he again preferred), he identified the main structural feature, the clusters around himself and Masatoshi Nei, as representing two slightly different subject areas. Both the Nei group and the Slatkin group, he said, have contributed to the literature on genetic flow and population structure, but the Slatkin group has contributed relatively more to the literature on microsatellites (short, repetitive sequences of DNA). Hence, the PFNET was picking up a division he found meaningful.
Interestingly, at the lower left the SOM conjoins Wright, Mayr, and Fisher, who represent the older, pioneering generation in statistical genetics. The SOM algorithm is able to bring this out solely on the basis of their overall cocitation profiles.
If PFNETs seem directive about term relationships, SOMs are merely suggestive. However, their greater ambiguity is perhaps a virtue.
Using AUTHORLINK, the forerunner of PNASLINK, Buzydlowski (9) found that SOMs outperformed PFNETs in capturing the mental models of 20 experts in selected fields of the humanities. ... The experts’ mental models were elicited by having them sort cards bearing authors’ names into intuitively meaningful piles. ... SOMs agreed with the card-sort data better than PFNETs. In the Plato trial, both SOMs and PFNETs were highly correlated with the pooled card-sort data (SOMs, r 0.97; PFNETs, r 0.78), but these correlations were significantly different at P 0.001. In the individual-author trials, a t test of mean agreement scores favored SOMs significantly at P<0.01.
沒有留言:
張貼留言