Gowanlock, M., & Gazan, R. (2013). Assessing researcher interdisciplinarity: A case study of the University of Hawaii NASA Astrobiology Institute. Scientometrics, 94(1), 133-161.
本研究結合書目計量學技術與機器學習演算法評估夏威夷大學NASA天體生物學研究中心(UHNAI)的研究跨學科性(interdisciplinarity of research)。以UHNAI發表的論文資料為研究的單位,彙整論文本身的摘要和其引用的參考文獻的摘要代表該論文,UHNAI團隊共計有731篇論文,這些論文引用的相關摘要則有10216筆,另外根據其引用的期刊主題分類(Journal Subject Categories)分布製作13個合併主題分類(conflated SCs)。探討的主要問題包括:(1)評估WoK的期刊主題分類是否適合用來標註天體生物學的文獻;(2)利用合併的主題分類確認天體生物學實際與潛在的跨學科研究案例;(3)利用代表UHNAI團隊研究軌跡(research tracks)的彙整摘要確認天體生物學實際與潛在的跨學科研究案例,並確認研究人員間的潛在合作機會。
本研究採用Porter et al. (2007)依據美國國家科學院(National Academies)2005年對跨學科研究(interdisciplinary research)和多學科研究(multidisciplinary research)的定義,前者係整合來自兩個或以上學科的概念、理論、技術與(或)資料,後者則不須跨學科整合,僅需要採用其他專業知識體系的元素,導引出大於各部分總和的研究。
根據van Leeuwen (2007)的整理,研究的跨學科性可分為兩種書目計量取向,一為考慮較高層次的出版物聚合,例如國家或大學等的研究產出,通常利用現成的主題分類(SCs),將它視為是學科界限(disciplinary boundaries),進行由上而下(top-down)的分析,例如van Raan and van Leeuwen (2002) 和 Porter et al. (2007)利用SCs做為測量作者、期刊或研究領域跨越科學領域的基準;另一為單一文獻及其引用論文的由下而上(bottom-up)分析,根據作者在文件題名、摘要、關鍵詞或全文上的詞語,描述研究人員,期刊或整個領域的結構,並建議有效的未來方向。由於本研究關注未來的整合,而不是過去的產出,因此採用後者的方式。過去這方面的研究有 Kostoff et al., (2001)以引用與被引用論文的文字欄位,進行片語(phrase)的頻率和集群分析,了解研究的影響力和跨學科性;以及Rafols and Meyer (2010)結合由上而下與由下而上的方式測量學科多樣性與知識的整合。
另外,在分析跨學科邊界的合作可能性時,一般會去採訪領域專家。但這種方法受到樣本大小與主觀性等限制(Zhang et al., 2011),另外分析像是天體生物學這樣跨越多個學科的主題時,需要相當博學專家的知識。在考慮上述的限制後,本研究建議天體生物學的科際整合性分析,應由一或多個精通該主題的人員指導,但其專業知識不必要涵蓋所有的組成學科。因此,本研究採用不需要先備知識的非監督式方法來發現資料的趨勢,該方法為(Slonim et al., 2002)提出的sIB (sequential Information Bottleneck)文本集群分析。透過文本集群和分類可以描述合作與知識整合等現象,對於研究軌跡底層結構的揭示能夠提供對天體生物學研究者有用的結論。
本研究的結果顯示經由對摘要資料進行文本探勘產生的集群通常與SCs不太一致。 因此,本研究認為SCs不太適合應用於天體生物學出版物的分類,並且推測其他科際整合領域也是如此。其原因一個解釋是天體生物學研究成果會引用單學科和跨學科的出版物,可能會阻止SCs形成凝聚性的集群。此外,正如Small(2010)中所討論的,許多期刊發表了高度多樣化的內容,以期刊做為分類層級的系統並不能完全表現。10個集群是最合適的分類結果。太少的集群,無法表現來源文件的學科整合多樣性;過多的集群,則可能會太分散,從而減少發現來自不同學科和主題的共同性的機會。本研究建議,當來自不同SCs的文件聚集在一起時,這可能表明隱含的跨學科聯繫,某一個領域的知識可能對另一個領域有啟發的效用。由組成學科的研究人員評估這些共同的文件,可能可以提供一個科際整合科學發生的機制,並提供了一個潛在的跨學科合作的起點。而且利用sIB進行彙整摘要資料的文本探勘也適合用於發現合作的機會,本研究發現來自相同學術部門的作者的論文較有可能集群在一起,這也證實本研究使用的方法所產生的集群內能聚集相似的論文。因此,論文在同一集群下的作者可能可以進行生產性較高的合作。而論文分散在多個集群的作者可能表示他們有參與跨學科研究。研究也發現UHNAI的研究人員和博士後研究人員的論文大部分出現在多個集群中,是跨學科研究的主要族群。
In this study, we combine bibliometric techniques with a machine learning algorithm, the
sequential Information Bottleneck, to assess the interdisciplinarity of research produced by the
University of Hawaii NASA Astrobiology Institute (UHNAI).
In particular, we cluster abstract
data to evaluate Thomson Reuters Web of Knowledge subject categories as descriptive labels
for astrobiology documents, assess individual researcher interdisciplinarity, and determine where
collaboration opportunities might occur.
Following van Leeuwen (2007), we distinguish between a top-down bibliometric
approach, where large-scale trends at the highest levels of publication aggregation are considered
(such as the research output of a country or university), and prefer a bottom-up approach, where
we analyze individual documents and the papers they cite.
A common method used to examine the potential of collaboration across disciplinary boundaries
is to interview domain experts, but this method suffers from several limitations, such as sample
size and subjectivity problems (Zhang et al., 2011). Furthermore, given that the subject matter of
astrobiology spans many disciplines, meaningful analysis of the responses would require the knowledge
of an astrobiology polymath.
After considering these limitations, we suggest that measuring
interdisciplinarity should be guided by one or more individuals versed in astrobiology, but whose
expertise need not span all of its constituent disciplines. Therefore, an unsupervised approach is
optimal as such methods can find trends in data without prior knowledge of its structure.
In this pilot study, we investigate the use of an unsupervised machine learning clustering technique,
the sequential Information Bottleneck (sIB) (Slonim et al., 2002) to aid in measuring researcher
interdisciplinarity.
Furthermore, we assess the extent to which Journal Subject Categories
from the Thomson Reuters Web of Knowledge database suite are sufficient for labelling astrobiology
documents.
The clustering and classification of text allow interdisciplinary analysis that 1)
describes collaboration and the integration of knowledge and 2) draws conclusions that are useful
to astrobiology researchers by uncovering the underlying structure of research tracks.
The multidisciplinary
context given by astrobiology affords an excellent opportunity to examine the methods used to study
researcher interdisciplinarity and knowledge integration.
Furthermore, we propose an iterative
process to identify specific publications that bridge diverse fields, to facilitate interdisciplinary
collaborations and ease the cognitive load of a single researcher who wishes to integrate knowledge
from multiple disciplines.
Research that occurs at the intersection between disciplines is thought to lead to great advances in
science (Porter and Rafols, 2009).
We
adopt the definition suggested by Porter et al. (2007), which followed the definition given by the
National Academies (2005): interdisciplinary research requires an integration of concepts, theories,
techniques and/or data from two or more bodies of specialized knowledge. Multidisciplinary
research may incorporate elements of other bodies of specialized knowledge, but without interdisciplinary
synthesis (Wagner et al., 2011) that leads to research that is greater than the sum of its
parts.
The usefulness of bibliometric indicators depends critically on
the level at which we wish to understand the integrative process. For example, funding agencies
may only require high-level publication co-authorship and collaboration statistics, describing the
research performed by their grantees and the diversity of their home disciplines, but not addressing
the essential aspect of synthesis.
Top-down approaches have been used to map scientific literature (for example, see Boyack et al.
(2005)), and often represent broad areas of science with Web of Knowledge (WoK) subject categories
(SCs). For example, van Raan and van Leeuwen (2002) and Porter et al. (2007) used SCs in
their methodology to measure interdisciplinarity. In these studies, SCs have been employed as de
facto disciplinary boundaries, and as a benchmark to measure how much a given author, journal
or research area crosses scientific fields.
Unfortunately, low-level conclusions that might inform
potentially productive individual collaborations cannot be made when relying on these top-down
approaches, as they focus on past outputs rather than future integration.
Conversely, bottom-up
bibliometric approaches incorporate the authors’ own words, in free-text fields such as: titles,
abstracts, keywords1 and the full text of a document. Clustering bibliometric data at this level
can describe the structure of a researcher, journal or an entire field, and suggest productive future
directions.
A study by Rafols and Meyer (2010) combines bottom-up and top-down approches
to measure both disciplinary diversity and knowledge integration.
While bibliometric studies tend to rely on a citation analysis, such an analysis may not be appropriate for every discipline or field. For example, a given field may tend
to reference conference proceedings, websites, newspapers, or colloquia which are not as conducive
to a co-citation analysis as journal articles. Due to this observation, Sugimoto (2011) suggests that
studying interdisciplinarity should include publications beyond journal articles.
One of the goals of this research is to uncover the underlying structure within
an astrobiology research team that undertakes interdisciplinary projects at the macro scale, but
may differ in the extent of interdisciplinary work at the micro level.
To understand the research
structure, we examine the abstract text of research publications and employ a method from the
field of information theory, the sIB method, to cluster our high dimensional abstract data.
An advantage of using WoK for bibliometric studies is that it provides a mapping of SCs to
each journal. Given the incommensurability of other bibliometric data (for example, journals do not
agree upon a common set of keywords), SCs provide a way to compare publications on the journal
level.
In Porter et al. (2007), the authors
examine the references in sets of journal articles gathered from WoK, and relate the journals to
their corresponding SCs. In this approach, a more diverse set of SCs that represent a paper derived
from its references indicates a higher degree of interdisciplinarity than a set of similar SCs that
represent a paper.
In particular,
we combine all of the abstracts of all of the references cited by a UHNAI publication, and use these
aggregated abstracts to represent each publication.
In another text mining study (Kostoff et al.,
2001), employed free-text fields (such as title, keywords and abstracts) of cited/citing publications
in combination with phrase frequency analysis and phrase clustering analysis to obtain a low-level
understanding of research impact and interdisciplinary research.
In the following subsections, we describe our methods used to achieve the following goals:
• Examine whether WoK SCs are sufficient for labelling astrobiology documents.
• Identify actual and potential instances of interdisciplinary research in astrobiology using conflated SCs (Section 3.3).
• Identify actual and potential instances of interdisciplinary research and identify potential
collaboration opportunities between researchers using aggregated abstracts to represent the
research tracks of the UHNAI team (Section 3.4).
We chose this clustering method over others because it has been shown to perform better
than other unsupervised clustering methods, such as k-means (Slonim et al., 2002). Furthermore,
the approach should allow us to identify instances of interdisciplinary research by examining the
cluster membership of our abstract data without prior knowledge of the data’s properties. It
is necessary to use an unsupervised clustering method because a canonical set of astrobiology
documents with which to train a clustering technique does not exist.
We modify the SCs using the following method:
• Journals with a single WoK SC that appears 10 or more times in our dataset uses the assigned
WoK SC name.
• Journals with a single WoK SC that appears less than 10 times is changed to a broader
WoK category (e.g. “Biochemical Research Methods” becomes “Biochemistry & Molecular
Biology”).
• Journals with two or more SCs of roughly equivalent weight are assigned a new conflated SC
(e.g. “Astrophysics & Geophysics”).
• Journals with two or more SCs that have a clear primary SC have “-Multidisciplinary” appended
to the primary name.
The
dataset has 10216 abstracts integrated over 13 conflated SCs.
We use the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al.,
2002) to produce synthetic feature vectors, where a feature vector (or feature) is a normalized
numerical representation of the words that describe each abstract/instance.
We use
SMOTE to create synthetic feature vectors for the minority SCs such that each SC is represented
by the same number of features.
The dataset contains 731 publications by the UHNAI team.
Each publication is represented
by its own abstract and the abstract of each cited publication. We aggregate all of these
abstracts in a single feature vector to represent each UHNAI publication. Non-journal publications
such as book chapters, conference proceedings and dissertations were included in the dataset,
although they constitute a very small fraction of the total publications.
For the purposes of this
paper, where our goal is to identify actual and potential instances of interdisciplinary research in
astrobiology, a meaningful cluster relationship is one where papers from two or more SCs cluster
together, or when researchers from different fields have the aggregated abstracts of their papers
cluster together.
We begin by estimating the extent to which conflated Web of Knowledge Subject Categories accurately
describe the content of astrobiology publications.
However, when abstracts are assigned one of five clusters
(Figure 4-top panel), we observe that the cluster membership for most SCs is heterogeneous: there
is no clear correspondence between a cluster and a single dominant SC. Even the most common SC,
Astronomy & Astrophysics, is primarily distributed across the first three clusters, but is represented
in all five.
Table 5 does suggest some areas in which SCs may be more appropriate document labels. For
example, Oceanography appears in only one cluster, and the Multidisciplinary Sciences SC is fairly
evenly distributed across four of the five. However, when increasing the number of clusters to 10,
15, and 20 (Figure 4), the heterogeneity of SCs within an individual cluster becomes even more
pronounced.
One would intuitively expect more SC heterogeneity
within each cluster; however, increasing the number of clusters also allows more potential of
each SC to dominate a single cluster. When we increase the number of clusters to 10 (Figure 6, Table
7), we find that most of the SCs disperse into multiple clusters.
One way to interpret this result
is that more clusters allow finer distinctions between content to be revealed.
At the 10 cluster level, more clusters contain single dominant SCs
than at the 5 cluster level.
Figure 7 (Table 8) and Figure 8 (Table 9) present the results of clustering the abstracts into 15
and 20 clusters, respectively. We observe that many of the SCs are found distributed in multiple
clusters.
Therefore, at these clustering levels, we operationalize a dominant
SC within a cluster as one that either constitutes 50% or more of the abstracts alone, or one that
is within 50% of the size of the most common SC5
.
By this approximation, the results at the 10
cluster level hold: as a group, the Biochemistry and Biotechnology-related SCs dominate the fewest
clusters; the Astronomy, Oceanography and Physics group slightly more, and the Geochemistry and
Geophysics SCs are again the most diverse, short of the Multidisciplinary Sciences SC.
Overall, at
the 10 cluster level, more clusters contain single dominant SCs than at the 5, 15 or 20 cluster levels,
and the usefulness of SCs as document labels reaches a relative maximum.
In some cases, the trial processes reveal some inconsistencies in the cluster membership of
SCs.
Certain related SCs tend to consistently cluster together, which suggests that SCs are sufficient
for characterizing astrobiology publications. However, other SCs have a limited effectiveness as
document labels in this interdisciplinary domain, as some SCs did not map well to successively
smaller cluster sizes.
Therefore, our results suggest that WoK SCs may not consistently reflect the diverse content of astrobiology publications.
Across all three trials at the 10 cluster level in Figure 6, a single clearly dominant SC could be
identified in 27 of the 30 clusters. The Astronomy, Oceanography and Physics SCs demonstrated
somewhat less monodisciplinary dominance at the 10 cluster level; all had roughly 20% of their
abstracts assigned to other clusters. The Geochemistry & Geophysics and Environmental Sciences
SCs demonstrated the most diversity apart from the pure Multidisciplinary Sciences SC, though
somewhat surprisingly, the Geochemistry & Geophysics-Multidisciplinary SC appeared in fewer
clusters than its core SC.
Analyzing the heterogeneous cluster membership of publications from diverse SCs is one way
to assess interdisciplinary research possibilities, but the probabilistic nature of this method should
be emphasized. A heterogeneous cluster could indicate that SCs are poor document labels, or that
the clustering level should be adjusted to better match the data and metadata, or that a potential
interdisciplinary relationship exists. In either case, this process could inform targeted, iterative
investigation.
This result suggests that the sIB technique is able to cluster similar research on a high-level; however, utilizing more
clusters should provide a lower-level view of overlap in research interests between the authors
When running the sIB technique for 10 clusters, we begin to see where researchers may find
potential collaboration opportunities, and we observe which authors have specialized or broad research
interests. Research can be specialized but still integrate methods, techniques and data from
multiple disciplines. We believe that an author who is represented primarily in a single cluster
may not be engaging frequently in interdisciplinary research, or may be focusing on narrow research
problems, or using similar research methods or equipment. In Figure 10, we see that the
two astrochemists (Bennett and Kaiser) are entirely represented by cluster 8, consistent with the
results presented in Figure 9. We know that their research is heavily influenced by their experimental
apparati, thus suggesting that the experimental methods and apparati significantly affect
the description of a research track. Interestingly, Sch¨orghofer’s research is on various planetary
bodies such as Mars and the Moon, which is also true of Taylor. Therefore, clustering the text of
the aggregated abstracts sufficiently illuminates similarities in research tracks across disciplinary
boundaries, in this case, between astronomy and geology.
In Figure 11, we observe that Huss, Jewitt, Krot and Meech’s research is found in many clusters.
This signifies that their research is likely to be very interdisciplinary. With regards to those authors
represented by a few clusters, we cannot conclude that their research is absolutely mono-disciplinary,
as it may be very specialized, or utilize the same methods or apparati. However, we believe that
those UHNAI authors with publications in multiple clusters are more likely to be engaged in
interdisciplinary research. In Figure 12, we observe that of the senior (non-postdoctoral fellows)
astronomers (Reipurth, Meech, Jewitt, Haghighipour, Owen, Sch¨orghofer) half (Meech, Jewitt,
and Owen) are fairly diverse in their research interests and the other half (Reipurth, Haghighipour,
Sch¨orghofer) are engaged in specialized or mono-disciplinary research.
These results suggest that the sIB method, in combination with aggregated abstracts, can illuminate
areas of implicit commonality where the research areas of scientists from diverse disciplines
overlap. Furthermore, while clusters do not inherently relate any information about a researcher’s
discipline, it is clear that researchers from the same department often cluster together. Therefore,
we expect that performing a similar analysis on the entire NASA Astrobiology Institute will show
where collaborations between researchers can occur, and can assist NASA with outlining research
priorities. These results can serve as the framework for a geospatial visualization of common yet
unconnected research tracks and potential collaborators, similar to the “hot regions” described by
Bornmann and Waltman (2011).