看見網絡: 2013

2013年12月19日星期四

Polanco, X., Francois, C., Lamirel, J. (2001). Using artificial neural networks for mapping of science and technology: A multi-self-organizing-maps approach. Scientometrics, 51(1), 267-292.

information visualization/self-organizing map

本篇論文提出一個利用multi-SOM為基礎的文件主題知識介面，以一個SOM表現一種描述文件方式的詞語資料(例如：論文中作者姓名或是研究方法相關的詞語)，並且利用SOM之間的對映關係，將multi-SOM串聯起來提供研究者在分析時的參考。這篇論文討論了(1)將文件在多維度上的距離關係盡量保留映射到二維平面上的關係，(2)出現在較多文件的詞語會在圖形上佔有較多個節點(也就是較大的面積)等SOM的特性，並且建議使用SOM作為分析參考時的程序，包括(1)在SOM產生後檢視主題的合理性，(2)解釋SOM圖形所表現的訊息，和(3)將不同的SOM圖形串聯起來。這篇論文的研究以植物相關的1843筆專利文件為探討對象，並且以724個詞語作為索引。

According to Kohonen (1997, p. 86), one might say that the SOM is a non-linear projection of the probability density function p(x) of the high-dimensional input data vector x onto the two-dimensional display (i.e., a map). ... The self-organizing map (SOM) gives central attention to spatial order in the clustering of data. The purpose is to compress information by forming reduced representations of the most relevant features, without loss of information about their interrelationships.

In the quantitative studies of science, the Kohonen self-organizing maps have been used for mapping scientific journal networks (Campanario, 1995), and also author cocitation data (White et al., 1998).

In the SOM, the competitive learning means also that a number of nodes is comparing the same input data with their internal parameters, and the node with the best match (say, “winner”) is then tuning itself to that input, in addition the best matching node activates its topographical neighbours in the network to take part in tuning to the same input. More a node is distant from the winning node the learning is weaker.

Like any unsupervised clustering method, the SOM can be used to find clusters in the input data, and to identify an unknown data vector with one of the clusters. Moreover, the SOM represents the results of its clustering process in an ordered two-dimensional space (R²). A mapping from a high-dimensional data space Rⁿonto a two dimensional lattice of nodes is thus defined. Such a mapping can effectively be used to visualise metric ordering relations of input data.

The SOM takes a set of documents in our case patents as input data (x), each patent is represented by an N-keywords vector (x∈Rⁿ), and maps them onto nodes of a two-dimensional grid (n∈R²).

The main properties of such self-organizing maps are the following: “First, the distance relationships between the input data are preserves by their images in the map as faithfully as possible. While some distortion is unavoidable, the mapping preserves the most important neighbourhood relationships between the data items, i.e., the topology of their distribution. Second, the map allocates different numbers of nodes to inputs based on their occurrence frequencies. If different input vector appear with different frequencies, the more frequent one will be mapped to larger domains at the expense of the less frequent ones” (Ritter and Kohonen, 1989, p. 246).

The algorithm is based on three computational levels: the winning node selection, the unsupervised learning and neighbouring definition, and the control mechanisms of the unsupervised self-organizing algorithm.

Due to the fact that there is obviously no absolute strategy for achieving that goal, the choice has been to implement two different kinds of strategies that could be indifferently used during the map interactive consultation phase. They are respectively called the clusters vector driven strategy and the document vector driven strategy. ... The clusters vector driven strategy consists of attributing to each cluster a name that represents the combination of the labels of the components having the maximum values in its vector. This strategy is well-suited in highlighting for the user the main themes described by the map. ... The document vectors driven strategy consists of attributing to each cluster a name that represents the combination of the labels of the components having the maximum values in either the vector of the most representative member of the cluster or the average document vector computed from all the cluster member vectors.

In order to reach that goal the task that the system operates is to reduce the number of cluster (i.e., the number of nodes) of the map in a coherent way. The method consists in starting from the original map and introducing new clustering levels of synthesis (i.e., maps) by progressively reducing the number of nodes. Since the original map has been build on the basis of a 2D square neighbourhood between nodes, the transition from one level to another is achieved by choosing a new node set in which each new node will represent the average composition of a square of four direct neighbours on the original level. ... This procedure has the advantage of preserving the original neighbourhood structure on the new generated levels. Moreover it ensures the conservation of topographic properties of the map nodes vectors, and consequently the conservation of the closeness of the nodes areas in the generalized maps.

Our empirical example is a set of 1843 patents about vegetal transgenic technology indexed by 724 keywords, and recorded in the period 1978–1997.

In comparison with the standard mapping methods, as such principal component analysis or multidimensional scaling, the advantage of the multi-map displays is the inter-map communication mechanism that Multi-SOM environment provides to user. Each map is representing a viewpoint. Each viewpoint is representing a subject category. The inter-map communication mechanism assisted the user to cross information between the different viewpoints.

On the computer system side, the following task was to build the maps representing the different viewpoints, using the map algorithm. The second was to use the inter-map communication for achieving thematic querying.

The system provides to human analysts a first level analysis with its unsupervised learning approach for extracting from the data the features that the maps display. The second level of analysis is constituted by the tasks that the human analysts should achieve. These tasks can be organized in three successive stages.

First stage: Validation of the Plant Map. Verifying if the clusters of an area really represent the plant instead of the usual host of a pathogen. Examining the cluster positions on the map; Observing the cluster vectors; Considering the relative size of the associated areas.

Second stage: Interpretation of the results. This asks strongly the background knowledge of the domain experts. ... The domain expert also detected ambiguous clusters on the map.

Third stage: Thematic queries using inter-map communication process. The activation of the whole area associated to a plant or a plant group. The use of the inter-map communication of the activity applying a possibilistic parameter along with a bias from the activated documents to the clusters of the target maps. The analysis of all the activated clusters on each of the target maps (i.e., target viewpoint).

Two goals could be achieved at least by the graphic user interface. The first is the working interface for elaborated clean and final results. The second is the visualisation of the clean and final results with browsing and querying functionalities.

The model that this multi-map environment provides is certainly the map but in its original extended version of intercommunication between multiples maps. Each map represents a particular viewpoint extracted from the data. These viewpoints are related either by the problem to be solved, or by the intercommunication mechanism between the maps. We have exposed both the map generation and their intercommunication mechanism. We finally showed how this clustering and mapping environment gives assistance to users in some watching intention.

A reason to use ANNs in quantitative studies of science and technology is their capability to create “higher abstractions from raw data completely automatically. Intelligence in neural networks ensues from abstractions, not from heuristic rules or manual logic programming” (says Kohonen, 1997, p. 65).

The maps play the role of strategic indicators because they provide a comparison way for evaluating the relative position of themes onto an ordered space.

Campanario, J. M. (1995). Using neural networks to study networks of scientific journals. Scientometrics, 33(1), 23-40.

information visualization/self-organizing map

本研究以自組織映射圖呈現期刊相互引用的網絡結構，作者認為由於Kohonen自組織映射圖本身的數學形式將使得愈緊密連結的期刊映射在愈靠近的位置，並且自我引用較多的期刊在圖形上佔有較大的面積。作者進行了四個資料集的期刊引用網絡研究，包括19種化學物理期刊和20種傳播學期刊，其餘兩個資料集是不同年份的社會學期刊，分別包括11種及13種期刊，用來比較領域在不同時間區間內是否產生變化。每一種期刊表示成一個n維的特徵向量，n是資料集內的期刊種類，特徵向量上的每一個成分對應到該期刊引用另一個期刊的次數。

If one paper cites an earlier publication, they bear a conceptual relationship. The references given in a publication link that publication to previous knowledge. In the network context, information is reconceptualized in terms of social linkages and shared meanings. According to Small and Garfield(1985), citation indexes, showing millions of interconnections annually among hundreds of thousands of scientific articles and books, seem ideally suited for deriving natural maps of the scientific landscape.

The results of studies carried out with the above methodologies have been used to identify science maps (Small, Sweeney, and Greenle, 1985; Small, 1993), maps of disciplines (Garfield, 1986), research fronts (Dixon, 1989), scientific journal networks (Midorikawa, 1983; Pinski, 1977; Saito, 1990), epistemic and conceptual networks(Leydesdorff, 1991; van Raan and Tijssen, 1993), invisible colleges (Lievrouw, Rogers, Lowe, and Nadel, 1987) or author networks (McCain, 1986) and to establish the rank of journals in a given network (Doreian, 1987; Hummon and Doreian, 1989; Doreian, 1994; Bonitz, 1990).

Journals are a central institution of science because they are the primary formal channels for communicating theories, methods and empirical results to the readers of those journals (Rice, 1990).

Four sets of journal-to-journal citation data were used to apply Kohonen's map algorithm to the study of networks of scientific journals. ... Data are for citations among 19 chemical physics journals pooled for 1981 (data set I), 20 communication journals pooled from 1977 to 1985 (data set II), 11 sociology journals pooled from 1970 to 1970 (data set III) and for citations among 13 sociology journals from 1975 to 1976 (data set IV). Data sets III and
IV include data referring to the same journals (for different years) with some new journals added in data set IV. This makes it possible to compare the results of two different years that are made up of almost the same journals.

To perform the computations, each journal was coded as an n-component vector (n representing the number of journals in a given set). Each component of the vector is the number of citations given by each journal to each other journal. The input vectors were normalized to allow the algorithm to normalize the weights.

The figures show the distribution of relational space among the journals. Because of the mathematical formalism of the Kohonen maps, the most closely linked journals are located close to each other. In addition, domains occupied by the journals with a large number of self-citations tend to be greater.

Most of the multidimensional statistical methods use a symmetrical matrix of relations among cases to define a distance. However, the cross-citation matrix among journals is asymmetric, as noted earlier. This fact reflects the hierarchical structure of journal relationships: some journals are subordinate to others (Leydesdorff, 1986). This problem is sometimes overcome by computing some kind of correlation in order to obtain a symmetrical cross-citation matrix. However, this transformation causes the loss of the hierarchic quality of journal interrelations. This is manifested in the domain map in which, sometimes, a given journal activates more cells than other closely linked journals.

Guerrero Bote, V. P., de Moya Anegón, F. and Herrero Solana, V. (2002). Document organization using Kohonen's algorithm. Information Processing and Management, 38, 79-89.

information visualization/self-organizing map

本研究以Self Organizing Map方法，將LISA資料庫中八類描述語(Acquisitions, Artificial Intelligence, Business Management, Computerized Information Storage and Retrieval, Conferences, Periodicals, WWW)的202筆摘要進行組織。結果具有鄰近的節點大多是具有相同描述語的摘要，而且鄰近區域也可以找出關連，例如Computerized Information Storage and Retrieval在產生的兩個圖形上所佔的區域與Artificial Intelligence和WWW的區域都相鄰。

The Kohonen's model is capable of performing a topological organization of the inputs presented to it.

This type of network has recently been used in documentation for the analysis of domains (White, Lin, & McCain, 1998), for textual data mining (Lagus, Honkela, Kaski, & Kohonen, 1999), to extract semantic relationships between words from their contexts (Honkela, Pulkki, & Kohonen, 1995, Ritter & Kohonen, 1989), and in particular to generate topological maps of sets of documents, even labeling the zones of influence of each word or term (Kohonen et al., 1999a; Kaski, 1999, Lagus & Kaski, 1999, Moya, Herrero, & Guerrero, 1998; Moya Anegón et al., 1999, Chen, Houston, Sewell, & Schatz, 1998; Lin, 1997; Huerrero Bote, 1997; Orwig, Chen, & Nunamaker, 1997; Lin, Soergei, & Marchionini, 1991).

In the learning process, as well as clustering the inputs, the Kohonen network generates a topological organization of those clusters. When we apply this to documentation the result will be the creation and organization of clusters in a manner that those which are topically close will also be close in the network. We may use this to expand the query, or rather the results: once one has found the cluster that best fits the query, one may extend the activation to those which are topologically close.

In some of these case, as well as performing a document classification, one determines for each node which unitary term vector produces the greatest activation. One may thereby generate each term's zone of influence, providing a graphical view of the database on which one could even select the zone that one wants to visit.

Linton, J. D., Himel, M. and Embrechts M. J. (2009). Mapping the structure of research: Business and Management as an exemplar. Serials Review, 35, 218-227.

information visualization/self-organizing map

本論文以202種商學與管理類的期刊為研究對象。每一種期刊蒐集約200筆摘要，統計詞與詞對在所有摘要內的出現次數，選取出現次數較多的300個詞與163個詞對做為摘要的特徵，建立202種期刊的特徵向量，訓練自組織映射圖，並且進行映射。在產生的自組織映射圖上，領域相關的期刊可以被映射到相同或相鄰的節點上，如果期刊之間的映射距離較遠，它們之間的關係較小。作者認為這種方法所建立的自組織映射圖可以提供新進研究人員在選擇期刊上的參考，做為研究人員升等的評估參考，了解跨學科的研究領域的特性，並且提供大量資訊較佳的組織方法。

The relationship of different journals to each other is of interest to management academics and librarians for a number of reasons:

1) New researchers (students and junior faculty) often find it difficult to determine which journals offer a good fit with their interests because they are new to the field.

2) Evaluation of promotion and tenure is often complicated by lack of common agreement and sufficient domain knowledge for assessing the relevance and quality of the journals a candidate has published in.

3) Better understanding of the interdisciplinary nature of a field and the relative distance and proximity of different subfields is essential.

4) Demonstration of techniques and methodology to better organize large amounts of information without having to find individuals with suitable domain knowledge is useful and does not risk selection biases that can result from reliance on individuals.

5) Selection/cancellation of journals in a collection depends on many metrics. ... These techniques can assist in identifying the fit of journals that are candidates for the addition or cancellation process in a serials collection.

Parameswaran and Sebastian (2006) indicate that the problems with objective ranking studies include the following:

1) bigger journals may have higher citation scores because there is greater potential for citation;
2) journals connected with professional associations have a large default subscriber base;
3) all references are not equally important;
4) authors tend to cite more from their own culture;
5) citations are countedwhether they involve praise or criticism;
6) citation analysis does not capture influence that extends beyond academia;
7) some seminal works are so well known that authors no longer feel the need to reference them.

Method used in this study

202 business and management journals selected from the “Business,” “Business-Finance,” and “Management” categories of the Social Sciences Citation Index, the “Operations Research and Management Science” category of the Science Citation Index and the "Top 40" journal list selected by the Financial Times.

The 200 most recent abstracts were downloaded for each selected journal.

The 300 most frequently occurring words and 163 most frequently occurring word couplets, with the condition that the words that lacked specificity to business and management research were eliminated, were selected as a “dictionary” of common terms for the business and management field. Each journal is described by a vector offering the relative occurrence of the commonly occurring technical words and word pairs. Once this process was completed for all of the journals in the study , the set of vectors describing the journals was entered into a Kohonen self-organizing map (SOM) program.

Results of the Mapping

The journals are all arranged based on the relative use of commonly used technical words. Journals that sit in the same cell are very similar to each other. Journals that are sitting in adjacent cells have clear similarities in relative frequency of common technicalwords.While there are some similarities amongst
journals that have a cell situated in between them, once journals are further apart than this, the degree of similarity declines rapidly.

Another area in which interesting insights are offered is into the structure of interdisciplinary research. In some cases, journals associated with different traditional disciplines sit in adjacent cells. This relationship suggests that the journals may offer a bridge between these fields.

The self-organizing map has substantial utility for management researchers and practitioners in the following ways:

1) This tool offers new researchers a way of accelerating the build up of domain knowledge regarding which journals are and are not a fit with their research interests and agenda.

2) The SOM can assist in bridging the lack of agreement and insufficient domain knowledge that adds great variability into the assessment of the relevance and quality of journals into the tenure and promotion process.

3) This approach can create a better understanding of the interdisciplinary nature of the field and the relative distance and proximity of different fields. The SOM can also be used to identify journals that have a role in acting as a bridge spanning two or more disciplines or subfields.

4) The process demonstrates techniques and methodology to better organize large amounts of information without having to find individuals with suitable domain knowledge and thus risk selection biases that can result with relying on individuals.

Lin, X. (1997). Map displays for information retrieval. Journal of the American Society for Information Science, 48(1), 40-54.

information visualization/self-organizing map

本研究分析瀏覽做為資訊檢索方式的適用性以及各種以瀏覽為基礎的視覺化組織格式。

資訊檢索包含搜尋及瀏覽兩種方式。若是要將瀏覽應用做為一種資訊檢索的方式，必須考慮 1)資訊項目需要具有良好的組織結構，2)使用者有需要探索他們不熟悉的集合內的資訊項目，3)使用者不了解集合內的資訊組織並且希望有較低認知負荷的探索方式，4)使用者對表達他們的資訊需求有困難和5)使用者能夠識別他們想要的資訊，但很難描述它們。

為了在資訊檢索服務提供瀏覽方式，需要將大量的資訊項目進行視覺化組織，並且希望能夠保留資訊的結構與關係，以提供使用者能有效地運用他們的視覺能力。因此本研究也比較階層式、網絡式、散佈式和地圖式等四種視覺化組織形式的特性與優缺點。

1) 階層式顯示的優點在於能夠藉由分層、分支與分群等方式簡化複雜的資料結構，可以同時表現出資訊的全體性質與區域性質，而且可以將觀看者的注意力集中在適當的廣度上。但若是資訊本身的結構不是階層式時，階層式結構往往過度簡化；此外，若是資訊空間較大時，較難產生與顯示階層式結構；並且使用者在選擇接下來要瀏覽的分支時需要較大的認知負荷。

2) 網絡式顯示以圖形上的節點代表資料項目，使用者在瀏覽時可以循著節點之間的連結線找到相關的資料項目。相較於階層式顯示，網絡式顯示能夠表示更複雜的結構，應用的範圍較廣。但也因為如此，當網絡結構較為複雜時，使用者可能不容易理解。

3) 散佈式顯示運用映射演算法將高維度的資料對應到二維的圖形平面上，並且這樣的映射過程必須使資料間原先的距離關係得以盡量保存在對應後的結果上。散佈式顯示非常適用於表現資料的整體型態，並且能夠展現出資料的意義結構。

4) 地圖式顯示劃出成若干區塊，每個區塊代表一個可能的主題(由一組相關的詞語組成)，區塊的尺寸大小表現主題的重要性(愈出現的詞語在地圖上佔有愈大的區塊，而其出現的)，區塊之間的距離遠近則表現出主題之間的關連程度，距離愈近的區塊表示主題之間愈相關。作者並認為地圖式顯示兼具前三者顯示方式的優點，也就是可以包含階層式顯示的階層的叢聚，網絡式顯示的相關連結，以及散佈式的空間映射方式。有別於前三者，地圖式顯示方式並不直接輸出最後文件資料映射到圖形上的結果，而是產生一個關聯式網路(associative network)。

本研究以三個文件資料集做為範例，以SOM做為地圖式顯示的方法，並且運用不同的索引(文件特徵)方式代表這些資料集內的文件資料。

(1) 311篇多語言資料檢索(multilingual information retrieval)主題相關的論文，以論文題名中出現的85個詞語為特徵，特徵向量上每個成分為是否有出現的二元值。結果發現地圖上的區域能代表論文的主題，區域的大小與主題出現的論文數量有關，較常出現的主題佔有較大的面積，相連的區域表示這些主題曾經共同在論文內出現。

(2) 660篇研究者個人蒐集的論文，以論文題名、關鍵詞及摘要出現的1472個詞語為特徵，特徵向量上每個成分為詞語在該篇論文資料的出現次數乘上詞語的倒文件頻率(inverse document frequency)。結果發現地圖上的區域能表示研究人員感興趣的主題，區域的面積愈大表示研究人員對這個主題愈重視；當論文映射到自組織映射圖上相對應的位置時，也可以發現論文的分布情形。

(3)143篇1990-1993年間SIGIR的會議論文，以論文題名出現的154個詞語為特徵，特徵向量上每個成分為詞語在該篇論文的題名、關鍵詞及摘要等資料的出現次數。產生的地圖作為資訊檢索的互動介面，可以增減自組織映射圖上出現的詞語數量，並可以點選圖上的任一區域檢索相關的論文。

Computers are expected to be used to reveal associations and properties of electronic information to allow people to use their visual capabilities for information seeking (Veith, 1988) .

The map display attempts to show both contents and semantic structures of a document space by mapping major concepts and documents of a document space to a two-dimensional map. It preserves, as faithfully as possible, document semantic relationships and reveals these relationships through various visual components of the display.

Most users have difficulty specifying their needs by a specific query formulation; even if users are successful in doing so, systems have difficulty retrieving all relevant documents without overwhelming the users with irrelevant documents. The issue of precision/ recall has been a bottleneck for retrieval systems: Retrieving more relevant documents (high recall ) is often at the price of getting more irrelevant documents (low precision) .

Visual displays that show terms and document relationships and reveal underlying structures of the document space will be such browsing aids that will relax demands on the performance of retrieval mechanisms and query generations. Such displays will allow the user to interact and browse a large quantity of search results in a limited display space.

Browsing is a direct application of human perception for information seeking, both in the electronic and non-electronic environment (Chang & Rice, 1993) . Browsing is explorative; it is an interactive process in which one will scan large amounts of information, perceive or discover information structures or relationships, and select information items through focusing one’s visual attention.

In relation to information retrieval, browsing is particularly useful when:

1) There is a good organizational structure and related information items are often located near each other (Thompson & Croft, 1989).

2) Users are not familiar with the content of the collection and they need to explore the collection (Motro, 1986).

3) Users have less understanding of how information is organized in the system and they prefer to take a low cognitive load approach to explore the system (Marchionini, 1987) ,

4) Users have difficulties in articulating their information needs (Belkin, Oddy, & Brooks, 1982).

5) users look for information that is easier to recognize than to describe (Bates, 1986) .

Some techniques that researchers have explored to support browsing for information retrieval include:

1) displaying a dynamic hierarchical information structure (Frei & Jauslin, 1983) ,

2) providing an overview map of the information space (Halasz, Moran, & Trigg, 1987) ,

3) providing a neighborhood map for each item (Thompson & Croft, 1989) ,

4) showing both a miniature of the entire information space and a detailed local map (Beard and Walker, 1990) ,

5) distorting the display so that the center of focus will be shown in more detail than other areas—the fish-eye views (Furnas, 1986) , and

6) supporting interactive functions such as zoom in, zoom out functions so that the user can select different level of details to display (Schatz & Caplinger, 1989) .

A central issue of organizing information for visualization is what formats and features of visual displays will help to organize large amounts of information to reveal information structures and to support effective use of human visual capabilities.

Hierarchical displays simplify complex data structures and separate data into different levels, branches, or clusters. These functions help to represent both global and local views of data, to utilize the display screen effectively, and to direct the viewer’s attention to the appropriate level of generality.

Cutting, Karger, Pedersen, & Tukey, (1992) showed that hierarchical clustering could be an effective information access tool, particularly for browsing.

These disadvantages of hierarchical displays include (1) oversimplification of structures for certain data, particularly for those that are more appropriate to be represented by structures other than a hierarchy, (2) difficulty in generating and displaying hierarchical displays for large information spaces, and (3) increased cognitive load for users who are forced to make selections among the hierarchical branches, especially when the whole hierarchy is not displayed on the screen.

Network displays show associative structures on the screen and let the viewer follow the links to browse items represented by the nodes. ... They can represent more general and complicated structures than hierarchical displays can. ... However, if all the relationships in a complex document space are displayed in a network, the network display simply becomes a network maze. The network displays thus often present more information than the user can immediately comprehend (Beard & Walker, 1990) .

Scatter displays refer to the graphical (dotted) image resulting from mapping high-dimensional data to a two-dimensional visual space. ... Most of the scatter displays are generated automatically by mapping algorithms. Because the mapping is usually driven by an error-minimum process or by the principle of finding a display configuration whose overall layout most closely matches the structure of the given data, the mapping creates a spatial orientation that reflects the overall layout of underlying data.

Scatter displays are very useful in revealing underlying data structures of statistical data (Tufte, 1983) . In particular, scatter displays can also be used to reveal semantic or intellectual structures embedded in statistical data.

Among the three display formats reviewed, scatter displays most faithfully reflect underlying data structures. In a scatter display, the viewer is not constrained to follow predetermined links as in the network display or to follow a rigid hierarchical structure in the hierarchical display. However, this lack of regularity in the scatter display also poses problems for the viewer trying to discover the underlying structure. In this respect, the scatter display particularly needs the help of other context or interactive probes such as verbal labeling or mouse sensitive areas.

Compared to the physical space, the document space is much less clearly defined in terms of its measurement, its dimensionality, and its semantic relationships, all of which largely depend on the selected indexing process. ... It would be difficult to have a map that is a ‘‘true’’ representation of the document space like the geographical map is for the physical space. ... The map displays should also provide rich visual information, and be able to present dynamic displays at different detail levels to allow the user to interact with the underlying information.

The map display was designed to provide the advantages of mapping, linking, and clustering as in the scatter displays, network displays, and hierarchical displays reviewed earlier.

The mapping algorithm selected will keep the display structure as similar to the underlying data structure as possible.

With an appropriate indexing, Kohonen’s feature map algorithm can be used to ‘‘survey’’ contents of a document space, to ‘‘detect’’ semantic relationships of terms and documents, and to generate map displays that will show both contents and semantic relationships of documents.

Kohonen’s feature map algorithm takes a set of input objects, each represented by an N-dimensional vector, and maps them onto nodes of a two-dimensional grid.

The mapping procedure is a recursive learning process of the following:
1) Select an input vector randomly from the set of all input vectors,
2) find the node (which is also represented by an N-dimensional vector called weights) closest to the input vector in the N-dimensional space,
3) adjust weights of the node (called the winning node) , so that it will more likely be selected again if this input is presented later,
4) adjust the weights of those nodes within a neighborhood of the winning node, so that nodes within this neighborhood will have similar weight patterns.
This process goes through many iterations until it converges, i.e., the adjustments all approach zero.

To ensure its convergence, two control mechanisms are imposed.
The first is the updating parameter. It approaches to zero as the number of iterations increases.
The second is the neighborhood structure that shrinks gradually during the process. A large neighborhood will achieve ordering and a small neighborhood will help to achieve a stable convergence of the map (Kohonen, 1989) . By beginning with a large neighborhood and then gradually reducing it to a very small neighborhood, the feature map achieves both ordering and convergence properties.

Early applications of the algorithm mostly demonstrated that the feature map could preserve metric relationships and the topology of input patterns.

I. A Map Display for a Retrieved Set of Documents

This example used a set of documents retrieved by a search done on INSPEC database in DIALOG for the topic of multilingual information retrieval. The set contains 311 documents. The indexing for this document set was based on titles only. ... As the result, 85 terms were retained to index the document set. A vector of 85 dimensions was created for each document, where a component was a ‘‘1’’ if the corresponding term occurred in the document title and a ‘‘0’’ otherwise. The document vectors were used as input to train a feature map of 85 input features and 10 by 14 output nodes arranged in a grid.

Results:

The areas on the resulting map can be seen as concept areas(more precisely, word areas) .
The size of the areas corresponds to the word occurrence frequencies.
The neighboring relationships of areas indicate frequencies of the co-occurrence of words represented by the areas.

A Map Display for a Personal Collection

The second example is a map display for a personal document collection. The collection contained 660 documents, which were accumulated over many years as a by-product of a researcher’s research activities. ... The indexing for this collection was fulltext-based—every word in the titles, keywords, and abstracts was used. After the stopword-removing and stemming procedures, and the elimination of the most-frequently and the least-frequently occurring terms, 1,472 terms remained in the indexing list. To create the indexing vectors, weights of each term were computed based on both the term frequency and the inverse document frequency. The 660 document vectors of 1,472 dimensions were then used as input to train a 10 by 14 Kohonen’s feature map of 1,472 input features.

Results:

The map display generated shows the researcher’s major research areas and the relationships of these areas.
The size of areas, corresponding to the frequencies of the words, indicates relative importance of the areas to the researcher (the more often a word appears in the personal collection, the more likely the word will correspond to a large area in the space) .
The neighboring relationships, corresponding to the frequencies of co-occurring words, reflect degrees of word associations as derived from the researcher’s collection.
When each document in the collection was mapped to a position on the display, the document distribution over the map display can also be shown.
The map can also reveal migration of the researcher’s interest over time.

A Map Display for Conference Proceedings

The third example is about documents from 1990–1993 SIGIR conference proceedings. These proceedings contain 143 documents. The indexing terms for this collection were collected from titles only, but the weights of terms were computed based on the term frequency in titles, keywords, and abstracts. ... After the same stopword-removing, stemming procedures, and elimination of the most- and the least-frequently occurring terms, 154 terms were used to index the collection, resulting in 143 vectors of 154 dimensions. These vectors were then used to train a 14 by 14 feature map of 154 input features.

As a mapping tool, the feature map has the properties of economic representation of underlying data and their interrelationships ( in these examples, the feature map self-organizes major terms selected from hundreds or thousands of indexing terms to represent the document spaces) .

As a visualization tool, the feature map produces rich geographical features that can be used for visual inferences. The algorithm generates an associative network as the output, rather than the direct mapping of the input. This makes it easy to implement various interactive tools and provide different ‘‘views’’ of the underlying information.

Finally, that the algorithm allows classification of any input to more than one location is certainly beneficial to information retrieval.

A major challenge to the success of document mapping is how to evaluate map displays and how to compare different map displays. ... This result suggests that comparisons of map displays need to be done on how the map displays help the user locate documents, not just how they look. It is quite possible to have different organizations of map displays that can provide the same level of access to a document space.

Smith, K. A. and Ng, A. (2003). Web page clustering using a self-organizing map of user navigation patterns. Decision Support Systems, 35, 245-256.

information visualization/self-organizing map

本研究利用自組織映射圖技術，將235個 Monash University, School of Business Systems的網頁進行叢集。本研究利用網頁伺服器上儲存的記錄檔做為自組織映射圖在訓練及映射時的參考資料，首先將記錄檔依據記錄上的使用者和時間資料，劃分成8054個Transactions，再以K-means叢集方法，根據每個Transaction裡的網頁，將Transactions歸類成9類，統計235個網頁在9個Transaction分類上的數目，建立代表網頁的特徵向量。由於在結果的自組織映射圖上，相同目錄的網頁會被映射到相同或鄰近的節點上，由此可見，利用自組織映射圖技術以及使用者存取網頁的資料可以用來對相關的網頁進行歸類與視覺化。

For the system to accurately reflect the needs of users, the organization of the web documents should also take into account the feedback from users. While it is useful to have a system to organize the web pages in a content-driven manner, it may be more advantageous to organize the web pages in a web-user oriented manner. After all, the web documents are organized so that humans can search in a more effective and efficient manner.

The authors have developed the prototype of the LOGSOM system based on the access logs for September of 1999 from the Monash University, School of Business Systems web server. There are 170,515 entries in the web log indicating the date, time, and address of the requested web pages, as well as the IP address of the user’s machine. ... The original server logs are formatted, cleansed, and then grouped into meaningful transactions before being mapped onto the self-organizing map.

Following Cooley et al. (1999), the authors group the data into meaningful transactions. The authors define a transaction as a
set of web pages requested by a user in a particular session. ... For the examined web log, the number of transactions m= 8054 and the number of URLs n = 235.

The number of inputs of the SOM will need to be equivalent to the number of transactions, and because this number is so large, it will not be feasible with this data. ... By using the K-means clustering algorithm, we cluster the transactions into nine groups. The number K=9 is chosen arbitrarily. ... Thus, after the dimension reduction, it consists of 235 URLs X 9 transaction-groups.

The distance between nodes on the resulting map indicates the similarity of the web pages, measured according to the user navigation patterns. LOGSOM provides a visual tool to enable users to see the relationship between web pages based on the usage patterns of web users similar to themselves. LOGSOM also provides an analysis tool for web masters and web authors to better understand the interests of visitors to their pages, and identify potential referring pages.

White, H. D., Lin, X., Buzydlowski, J. W. and Chen, C. (2004). User-controlled mapping of significant literatures. Proceedings of the National Academy of Science of the United States of America, 101, 5297-5302.

White, H. D., Lin, X., Buzydlowski, J. W. and Chen, C. (2004). User-controlled mapping of significant literatures. Proceedings of the National Academy of Science of the United States of America, 101, 5297-5302.

本研究使用尋徑者網路(pathfinder networks, PFNET)和自組織映射圖(self-organizing maps, SOM)兩種維度縮減(dimension reduction)技術做為PNAS期刊論文檢所的圖形化介面，輸入一個詞語或一位作者，產生這個查詢與其相關的24個詞語或24位作者的圖形。以Gene Frequency與這個主題的重要作者Slatkin做為查詢所產生的PFNET與SOM，提供Slatkin檢視，他認為產生的圖形很容易解釋。本研究並說明與比較了PFNET和SOM作為資訊視覺化界面的特點。

information visualization

Our data are the contents of PNAS for 1971–2002, as described by medical subject headings from the National Library of Medicine (NLM) and by citation indexing from the Institute for Scientific Information (ISI).

SOMs show frequently co-occurring terms as nodes that are spatially close. PFNETs show them as nodes with explicit ties. The two kinds of maps will be exemplified here with medical subject headings (MeSH) and cocited authors in a specialty of genetics.

In their extensive review, Borner et al. (5) emphasize that ‘‘painting a big picture’’ is a main goal in domain mapping. This may lead to a strategy of mapping very large co-occurrence matrices in their entirety. Indeed, system designers have made many significant developments in software for such global portrayals of literatures, e.g., THEMESCAPE and VXINSIGHT render literatures as landscapes; GALAXIES and STARRYNIGHT render them as astral bodies (10–12).

In global mapping, system designers present the user with a preformed view, often in 3D, of some sizeable literature. Within the panel of visualization, landscapes invite flyovers; star-fields or other constructs invite flythroughs. In the former, peaks representing major accretions of documents on some subject are likely to exert a powerful pull on the user; in the latter, document points coded as important, e.g., by differences in shape, size, or color, exert a similar pull.

Essentially, the user is engaged in old-fashioned browsing, as of book titles in library stacks, but system designers may minimize or even eliminate labeling of objects in the map because labels clutter precious screen space and block the metaphorical presentation (see examples in ref. 12).

The user explores the view by ‘‘visiting’’ or ‘‘homing in on’’ objects of interest, rather as in video games, but typically cannot remap the literature in pursuit of some new interest because a new map takes hours of computer time to create.

Ours, however, is an alternative way of visualizing knowledge domains, the localized mapping. Perhaps the chief difference is that the localized approach relinquishes scope to increase the user’s control of the mapping process.

... our localized system of mapping more closely resembles online searching. The user starts the process by entering a single term at a web interface. This is consistent with the way most people search the web (13) and is intended to minimize cognitive demands on users. The system responds to the entry (or ‘‘seed’’) term by forming a list of the terms that co-occur with it, ranked high to low by frequency. The seed term and its 24 next-highest neighbors are then exhibited as a PFNET or a SOM, which the user can switch between.

If the indexing terms used in the mapping are indeed controlled by a formal thesaurus, our SOMs and PFNETs provide an alternative: they display the top listings in what is sometimes called a term’s associative thesaurus (2).

A map of cocited authors is, in effect, an associative thesaurus of authors linked by conjoint use of their works. Again, these linkages may permit useful retrievals that are not otherwise possible (1).

PFNETs and SOMs are dimension-reduction techniques that have been used to visualize the structure of literatures for more
than a decade.

In the context of the movement joining bibliometrics with document retrieval (2, 5, 10), PFNETs have been described by Fowler and colleagues (15–17), McGreevy (18), and Chen (19, 20). Analogous accounts of SOMs have been done by Lin et al. (21), Roussinov and Chen (22), and Chen et al. (23).

The number of links in a PFNET is controlled by two parameters, r and q.

The parameter r, which determines how path weights are computed, is lucidly explained by Fowler et al. (17): ‘‘Path weight, r, is computed according to the Minkowski r-metric. It is the rth root of the sum of each distance raised to the rth power for all links in a path between two nodes. Although the r-metric is continuously variable, simple interpretations exist only for r =1 (path weight is the sum of the link weights in the path), r=2 (path weight is the Euclidean distance), and r=infinity (path weight equals the maximum link weight in the path). One advantage of r=infinity is that one need only assume that the original distance estimates have ordinal properties. Another advantage is that the link structure will be preserved for anymonotonic transformation of the data.’’

The parameter q sets the range within which all paths of length q will be examined in the test of the triangle inequality (24) and removed if they violate it. The larger the value of q, the more extensive the triangle inequality constraint; therefore, links are more likely on a path that violates the rule. If q is one less than the number of nodes, then all of the potential violators are under scrutiny.

The more frequently co-occurring terms, which presumably have greater mutual relevance, occupy more proximate regions on the map. SOMs are designed to render not just the highest co-occurrence counts between terms, but rather relatively high co-occurrences across groups of terms.

They are a softer-focus kind of mapping than PFNETs, but they, too, suggest specific combinations of terms on which the user might want to base retrievals.

This process of self-organization (also known as unsupervised learning) runs over many iterative cycles. In each iteration, the images of term pairs that are strongly related in the high-dimensional space will be moved closer on the lower-dimensional space until stability is reached.

A row from the cooccurrence matrix ‘‘is randomly selected and compared to every output node to determine a winner. Weights of the winning output nodes then are updated so that the next time this input node is presented, this output node will likely be selected again as the winner. In the meantime, nodes surrounding the winning node are similarly adjusted.

The number of iterations needed to train a SOM is often determined empirically (in our case, we optimize the number of training cycles to 2,500).

After the training, input vectors closest in the input space will map to the same regions in the output map. The regions are delineated by areas of nodes in which the elements with the highest value on the vectors are the same.’’

Adjacent areas reflect stronger relationships than nonadjacent areas. Terms in large areas are more influential than terms in small areas.

Slatkin found his own cocited author maps readily interpretable. He was acquainted with every name that appears in Fig. 2. In the PFNET (which he again preferred), he identified the main structural feature, the clusters around himself and Masatoshi Nei, as representing two slightly different subject areas. Both the Nei group and the Slatkin group, he said, have contributed to the literature on genetic flow and population structure, but the Slatkin group has contributed relatively more to the literature on microsatellites (short, repetitive sequences of DNA). Hence, the PFNET was picking up a division he found meaningful.

Interestingly, at the lower left the SOM conjoins Wright, Mayr, and Fisher, who represent the older, pioneering generation in statistical genetics. The SOM algorithm is able to bring this out solely on the basis of their overall cocitation profiles.

If PFNETs seem directive about term relationships, SOMs are merely suggestive. However, their greater ambiguity is perhaps a virtue.

Using AUTHORLINK, the forerunner of PNASLINK, Buzydlowski (9) found that SOMs outperformed PFNETs in capturing the mental models of 20 experts in selected fields of the humanities. ... The experts’ mental models were elicited by having them sort cards bearing authors’ names into intuitively meaningful piles. ... SOMs agreed with the card-sort data better than PFNETs. In the Plato trial, both SOMs and PFNETs were highly correlated with the pooled card-sort data (SOMs, r 0.97; PFNETs, r 0.78), but these correlations were significantly different at P 0.001. In the individual-author trials, a t test of mean agreement scores favored SOMs significantly at P<0.01.

White, H. D. and McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of Information Science, 1972–1995. Journal of the American Society for Information Science, 49, 327-355.

White, H. D. and McCain, K. W. (1998). Visualizing a discipline: An author co-citation analysis of Information Science, 1972–1995. Journal of the American Society for Information Science, 49, 327-355.

vis_paper

本論文探討作者共被引方法，並將其應用在資訊科學。這個研究分析了1972到1995年間12份資訊科學相關期刊內的作者共被引資料，以每八年為一期，所以整個24年研究共3期，每一期均找出被引用次數最多的前100位作者，整個期間共120位，其中的75位在三個時間都有出現。本研究使用的方法與結果分別如下

1) 對120位作者與其他作者的共被引次數形成的矩陣進行Pearson相關係數分析，再利用主成分分析(principal components analysis)與最大變異轉軸(varimax rotation)進行因素分析(factor analysis)，了解資訊科學的專業(specialty)結構。以特徵值(eigenvalue)大於1決定抽取的因素數目，每一個因素代表一個專業，如果作者在某一特定的因素上具有0.3以上的負荷(loading)，便視為引用者一般認為這位作者具有這個專業。由於作者可能在多個因素上都有超過0.3的負荷，因此每位作者可能會具有多種專業。在本研究中，共抽取出12個因素，可以解釋84%的變異情形，這些因素中前8個特徵值較大，可以從作者辨識的資訊科學專業為 a)設計與評估文件檢索系統的實驗檢索(experimental retrieval)；b) 研究科學研究文獻關連的引用分析(citation analysis)；c)應用於實際資料庫的實務檢索(practical retrieval)；d) 從文字及書目資料分布規律探討數學模型的書目計量學(bibliometrics)；e) 研究圖書館自動化、圖書館運作等議題的一般圖書館系統理論(general library systems theory)；f) 研究資訊需求與使用的使用者理論(user theory)；g) 研究科學的社會系統(social system of science)的科學傳播(scientific communication)；h)OPAC ；另外幾個因素則由研究被引入資訊科學的其他領域學者組成。根據各專業上的作者交互情形以及下述映射圖的結果，資訊科學可以分為對於知識文獻以及其社會脈絡的分析研究和人-電腦-文獻的介面研究等兩個次學科。

2) 根據120位作者在3個時期的平均共被引次數，分析他們在各時期的代表性與影響力。

3) 以作者的共被引次數矩陣所產生的相關係數，也就是他們被引用者一般認定的相似性，做為他們之間的關連性，利用多維縮放技術ALSCAL，將每個時期前100位作者映射成圖形，使得共被引次數分布彼此相似的作者在產生圖形上的映射點有較近的距離。並以叢集分析技術CLUSTER進行完全連結叢集(complete linkage clustering)，將作者根據他們之間的關連性分為次學科。結果發現，屬於同一個專業的作者在圖形上的映射點彼此間的距離比較近。並且如先前類似的研究所指出的，資訊科學很明顯地可以區分為資訊檢索及領域分析(domain analysis)等兩個次學科。比較不同時期的圖形，雖然少部分的作者映射點有明顯移動，但大多數的作者其映射點的位置相當穩定。

4) 從三個時期的映射圖上作者映射點位置的改變情形產生映射圖，表示作者引用形象(citation image)的改變。

5) 以經典作者(canonical auhtors)在三個時期的共被引相關係數為輸入，利用INDSCAL評估三個時期維度的重要性，從引用的角度驗證學科是否發生典範轉移的情形。結果發現表示「人-電腦-文獻」介面(human-computer-literatures interface)的第二個維度比起表示資訊科學主題專業的第一個維度在三個時期的重要性有大的變化，1972-1979年的第一時期這個維度的重要性不高，1980-1987年的第二時期其重要性則大幅增加，到了1988-1995年第三時期則稍微減少。許多研究者認為資訊科學在1980年代有典範轉移(paradigm shifting)發生，White and McCain上述的結果可以驗證這個現象。

We defined the authors of information science as all those cited in 12 journals, as listed below. Authors were ranked in order of citedness for the entire period covered by Social Scisearch, 1972–1995. Co-citation data were retrieved for all pairs in the top-ranked 120, from which we produced:

1) A factor analysis of the 120 authors for the entire 24-year span, 1972–1995, which reveals the specialty structure of the discipline. Factor analysis, unlike multi-dimensional scaling and clustering, can show an author’s contribution to more than one specialty.

2) Analyses of the 120 authors’ mean co-citation counts, which indicate their standing and influence in the discipline as of 1972–1979, 1980–1987, 1988–1995, and at the end of the three periods combined.

3) Two-dimensional maps of the top 100 authors in each of the 8-year periods (made with ALSCAL, the SPSS multidimensional scaling program) .

4) A map of authors whose ‘‘citation images’’ changed markedly over the years of our study.

5) A two-dimensional composite map of the authors who are in the top 100 in all three periods—some 75 in all. Their most cited works arguably make up the canonical literature of information science. Certain statistics generated by the mapping routine (INDSCAL, a part of ALSCAL) may bear on paradigm shift in the discipline.

In any field of scholarship, writers make judgments as to who has written on what, using what methods, and they reflect the judgments in their citing practices. Aggregated over time, these practices assume definite structure: Writers show commonalities in how they judge the subject matter, methodology, and intellectual style of other writers; for example, they often attach the same meanings and significance to precedent works (Cozzens, 1985; Small, 1978) .

It suggests how authors are commonly viewed on two dimensions, often interpretable as subject matter and style of work. ... Author clusters placed on these two dimensions can be interpreted as specialties within a discipline (White, 1990a, 1990b) .

What is actually mapped is an author’s citation image. Everyone ever cited has one, but only those who have been cited in many writings are likely to figure in ACA. In the latter case, the image has a constant part, the author’s identity as it is rendered in successive reference lists. The image also has a variable part, the gradually increasing set of other author-names that co-occur with a given author in those lists. At the end of a time period, ACA sums up the record by mapping the author as a single point among other selected author-points on the basis of the repeated co-occurrences. Authors with similar profiles of co-occurrences are displayed close together.

The decisive argument for ACA is that it enables one to see a literature-based counterpart of one’s own overview of a discipline.

As is well known, the closeness of author points on such maps is algorithmically related totheir similarity as perceived by citers. We use Pearson r as a measure of similarity between author pairs, because it registers the likeness in shape of their co-citation count profiles over all other authors in the set.

The raw co-citation counts were converted to Pearson r correlation matrices by the FACTOR routine in SPSS, and factors were extracted by principal components analysis with varimax rotation. The default criterion of ‘‘eigenvalues greater than one’’ determined the number of factors extracted.

The Pearson r correlation matrices for ALSCAL and CLUSTER in SPSS were generated with another SPSS rountine, CORRELATIONS ( cf. McCain, 1990) . They were treated as nonmetric (ordinal) similarity data in ALSCAL and grouped by the complete linkage method in CLUSTER. Subdisciplinary groupings of the author points on the maps are based on the dendograms from CLUSTER.

Authors in the top 100 in all three periods—‘‘the canonical 75’’—were separately mapped with INDSCAL, a routine in the ALSCAL bundle that does a specialized kind of multidimensional scaling. The input data to INDSCAL are judgments on the similarity of a set of stimuli by a set of judges. INDSCAL reveals not only the judges’ composite view of the stimuli in multidimensional space, but the weight each individual judge gives each dimension; INDSCAL is short for ‘‘individual differences scaling.’’ We used the individual weights in a new way to explore the notion of ‘‘paradigm shift’’ as it affects the canonical 75.

The two-dimensional space in which the authors appear is relative, not absolute, and it fails to capture certain relationships among oeuvres that appear in higher dimensionality.

Specialties

The results of the factor analysis, incorporating 24 years’ worth of data for the 120 authors, are presented in Table 3. ... Twelve factors were extracted; jointly (R2 ) , they explain 84% of the variance. ... The first eight factors alone explain 78% of the variance. All have seven or more authors with loadings greater than 0.60 and may be interpreted as specialties within the discipline.

The two biggest specialties, obviously, are experimental retrieval, which focuses on the design and evaluation of document retrieval systems, and citation analysis, which focuses on the interconnectedness of scientific and scholarly literatures, usually with data from ISI.

The third biggest specialty we have labeled practical retrieval. Unlike the experimental retrievalists, the authors in this group, rather than working with content-neutral indexing theory, thought experiments, or document testbeds, have tended to discuss retrieval in terms of ‘‘real world’’ databases; terms such as ‘‘INSPEC’’ or ‘‘DIALOG’’ occasionally profane their pens.

The next specialty we call bibliometrics—a word often used to subsume the specialty we labeled citation analysis. However, unlike the citationists, the authors who load primarily here, including the pioneers Lotka, Bradford, and Zipf, are most interested in mathematically modeling certain regularities in textual or bibliographic statistical distributions, irrespective of the literatures from which they come.

General library systems theory is a not altogether satisfactory name for a body of writings on library automation, library operations research, library and information service policy, retrieval system evaluation, and many other interconnected topics.

The specialty we call user theory is appropriately headed by Dervin, author of a highly cited chapter on ‘‘information needs and uses’’ in the 1986 ARIST. ... It will be seen that authors who write about literatures—the citationists, bibliometricians, and scientific communication people—never load above 0.30 on this factor, apparently because citers do not perceive their work as having the right psychological content. On the other hand, quite a few retrievalists load above 0.30, and this suggests the nature of the cognition involved. It has to do with problem-solving at the interface where literatures are winnowed down for users with: Question formulation, search strategies, information-seeking styles, relevance judgments, and the like.

Authors loading mainly on scientific communication all have strong disciplinary identities outside L&IS—for example, in sociology. They may be thought of as explicating the social systems of science, including those in which formal publication of results is an important (but not the only important) part. The sociologists among them all have loadings, some quite high, in citation analysis, confirming their relevance to the study of scientific literatures.

The design of computerized library catalogs, especially for subject searching, is the province of authors who load on OPACs (online public access catalogs) . It makes sense that leading authors here, such as Matthews, Hildreth, Cochrane, and Drabenstott, load secondarily in practical retrieval, just as several of the primary authors there, such as Borgman and Fidel, also turn up here.

As was said, the chief remaining factor seems a collection of authors in other disciplines from whom information science has imported ideas—e.g., cognitive science (Winograd) , information theory (Shannon) , computer science (Knuth)—that are all variously relevant to the central concern of information science, the human–computer–literature interface.

In fact, as both author cross-loadings and the maps below suggest, almost all of the factors or specialties in Table 3 can be aggregated upward into two larger subdisciplines: (1) The analytical study of learned literatures and their social contexts, comprising citation analysis and citation theory, bibliometrics, and communication in science and R&D; and (2) the study of the human–computer–literature interface, comprising experimental and practical retrieval, general library systems theory, user theory, OPACs, and indexing theory.

The Maps

Figures 2 through 4 are our 8-year period maps. We shall use them to explore the idea, introduced earlier, of two subdisciplines in information science.We operationalize this idea as the last two clusters joined in a complete-linkage clustering of 100 authors. These final clusters, which are brought together only after all closer ties have been exhausted, are separated by an angled line superimposed on each map.

We have not, as in the past, drawn lines around smaller clusters of authors corresponding to their specialties. The crowding of many names on the maps makes this difficult, and, besides, the specialties are better conveyed by the factor analysis of the earlier section. To a great extent, however, the authors forming specialties in the factor analysis will be found to have been placed near each other in the maps.

The first finding to note is the overall stability of information science, as here defined. Some author-points undergo remarkable changes of position from map to map, but many more authors stay put in discernible specialties. Fully 75, moreover, persist through all three maps.

We conclude that author co-citation analysis is useful for rendering the inertia of fields. In other words, it objectively captures the slow-changing divisions on which one’s subjective sense of ‘‘semi-permanent’’ disciplinary structure rests.

Co-citation analysis of papers, as opposed to authors, captures disciplinary history at a different, faster rate, which may better suit fields with livelier research fronts than information science.

However, ‘‘domain analysis,’’ as put forward by Hjørland and Albrechtsen (1995) , seems a more appropriate choice. It incorporates citation analysis and bibliometrics, but also a range of topics broader than what ‘‘bibliometrics’’ usually implies— for example, scholarly and professional communication, parts of sociology of science and sociology of knowledge, interdisciplinary linkages, discourse communities, and disciplinary vocabularies (cf. Beghtol, 1995) .

ACA’s confirmation of expert judgments by Hjørland and Albrechtsen, Persson, and the Vickerys is consistent with the claim that citation databases can be exploited for non-experts in a form of AI.

The axes in INDSCAL maps are not subject to rotation and are supposed to be maximally interpretable. Thus prompted, we think the horizontal axis conveys, as in past studies, the range of subject specialties within the subdisciplines of domain analysis and information retrieval. ... Coherent groups from left include the citationists, the arc of bibliometricians across the top and the philosophically orienting figures across the bottom, ‘‘generalist’’ writers such as Smith, Wilson, Saracevic, and Swanson, and the hard and soft retrievalists. The plot generally makes good sense. For example, it is easy to accept Bookstein, Tague-Sutcliffe, Kantor, Buckland, Vickery, and Shaw as transitional figures between the retrievalists and the bibliometricians.

The more interesting vertical axis reflects another subject-related continuum. Information science deals, we said earlier, with ‘‘the human–computer–literature interface.’’ If so, then the top pole represents a relative emphasis on literatures as objects of study, and the bottom, a relative emphasis on people or users. The same polarity can be inferred in earlier maps. Figure 4 showed that when a literature theoretician like Egghe enters, it is automatically at the top, whereas a user theoretician like Dervin is automatically placed at the bottom.

However, INDSCAL is expressly designed to reveal differences in the importance of each dimension to whoever is judging the similarity of stimuli. In our use of INDSCAL, the stimuli are the 75 authors, and the three periods are regarded as three separate ‘‘judges.’’

Usually, of course, persons are the judges in INDSCAL studies, and the ‘‘derived subject weights,’’ which are standard INDSCAL output, are taken to show the salience of each dimension to each person. In replacing individuals as judges with large numbers of citers, we are acting as if the citers collectively embodied the paradigm of information science in each 8-year period.

Accordingly, we interpret the derived subject weights for each period as indicating the relative importance of the dimensions within the paradigm. Thus, we can probe a hidden aspect of disciplinary history—whether key dimensions of the field were given about the same weight in all periods. If not, that would be consistent with a perception of paradigm shift.

Substantively, it is as if during 1972–1979 citers had regarded the range of specialties as by far the most important part of the information science paradigm, but then during 1980–1987 had taken much more cognizance of the differences in authors’ orientation toward literatures or users.

Perhaps the main weakness of this INDSCAL measure is that it is so indirect—that is, not clearly connected to specific papers with specific claims about the world. One expects evidence of paradigm shifts to leap from main texts, not references; from writers, not citers.

Though it might be used to discover paradigm shift, we think it has more promise as a means ofconfirming one. ... A shift detectable there implies not only that authors are promoting new lines of inquiry, but that citers are responding in such a way that the overall map of the discipline is changed.

Toward that account, ACA simultaneously provides both breadth and focus. It provides breadth by forcing contemplation of multiple specialties... It provides focus by forcing contemplation of particular authors, which is to say particular oeuvres and works. It also provides crude but unmistakable evidence of intellectual change.

The role of information science is to explicate the conceptual and methodological foundations on which existing systems are based’’ (Borko, 1968, p. 67). Or ‘‘Information science is the study of the means by which organised structures (which we call ‘information systems’) process recorded symbols to meet their defined objectives’’ (Hayes, 1985, p. 174) .

What they do study empirically, and uniquely, are problems associated with the human–literature barrier—the special difficulties of obtaining answers to questions from publications, in any medium, rather than persons. In other words, while many scholars seek to understand communication between persons, information scientists seek to understand communication between persons and certain valued surrogates for persons that literatures comprise (White, 1992).

This study requires a conceptual scheme that encompasses properties not only of literatures(e.g., size, growth rate, age, dispersion, authority levels, degree of summarization, quality of indexing) but also of people (e.g., interests and concerns, vocabularies, social ties, knowledge of existing systems, search styles, editorial strategies, resource environments).

The bond between domain analysts and retrievalists is their common interest in the literature barrier and related phenomena on both sides. The barrier in action is exemplified by information overload and underload—recurring topics for authors in both subdisciplines because they require both literatures and users to be discussed in a single framework, as implied by the second dimension of our maps.

訂閱：文章 (Atom)

2013年12月19日 星期四

2013年12月19日星期四