2017年11月30日 星期四

Chen, S., Arsenault, C., Gingras, Y., & Larivière, V. (2015). Exploring the interdisciplinary evolution of a discipline: the case of Biochemistry and Molecular Biology. Scientometrics, 102(2), 1307-1323.

Chen, S., Arsenault, C., Gingras, Y., & Larivière, V. (2015). Exploring the interdisciplinary evolution of a discipline: the case of Biochemistry and Molecular Biology. Scientometrics102(2), 1307-1323.

本研究從跨學科性(interdisciplinarity)的變化、核心學科的確定、學科的出現和潛在的學科檢測幾個方面,探討生物化學與分子生物學(Biochemistry and Molecular Biology, BMB)一百年間的科際整合演化,並利用科學地圖和河流圖(StreamGraph)做為檢視科際整合演化的視覺化工具

Porter和Rafols(2009)研究六個研究領域的三十年間跨學科性的演變,研究結果顯示,在這段時間內有顯著的變化,特別是每篇文章引用的學科數量和參考文獻的數量,以及每篇文章的合著者數量。不過,Rao-Stirling指數只是略有上升,Porter和Rafols認為這是由於論文的引用還是主要是在相鄰的學科領域內。Larivie`re and Gingras (2014)的研究發現,在1945和1975間跨學科研究下降後,已經逐漸上升。Levitt et al. (2011)研究SSCI (Social Science Citation Index)的各主題類別的跨學科性,則是發現1980和1990之間下降後,2000年已恢復1980年的水準。

在文獻計量學,跨學科性的測量通常是根據跨學科的引文關係。例如,Porter和Chubin(1985)使用「外部類別引用」(Citations Outside Category, COC)的指標衡量文章的跨學科性程度,這個指標定義為來自不同學科的引用或參考文獻的百分比。這種方法被應用到多個研究,例如Rinia等人(2002a)測量所有科學領域的跨學科性,Morillo et al. (2001) 測量化學領域的跨學科性。Adams等人(2007)除了不同學科的引用參考文獻的比例,也使用了引用的學科數量和Shannon多樣性指數 (Shannon Diversity Index)。Carley and Porter (2012)利用Rao-Stirling多樣性指數(Rao-Stirling diversity index),透過引用文獻,探索知識整合。Levitt和Thelwall(2008)則是使用WoS和Scopus的主題分類,以論文發表期刊的多重分類測量跨學科性

Rinia等人(2002b,244)將跨學科性定義為一組研究人員在他們的“主要”學科之外發表的文章的百分比。Qiu (1992)使用作者的組織隸屬關係探究跨學科合作。Abramo等人(2012)以義大利為案例研究,以研究人員的學科作為類別指定的方法,利用不同學科的合作者之間的合作確認跨學科性。Le Pair(1980)和Sugimoto等(2011)則是利用作者最高學歷學科探索跨學科Le Pair(1980)將跨學科合視為科學家在他們的生涯中從一門學科向另一門學科的轉移。Sugimoto等(2011)利用80年(1930-2009)的博士論文,從學術譜系描述了圖書情報學跨學科性變化程度。

本研究利用WoS資料庫上主題類別BMB下從1910年起所有的文章,共 1,539,526篇,以及其文件類型為期刊論文的參考文獻,共40,855,852 筆。參考文獻的期刊以美國國家科學基金會(National Science Foundation, NSF)的主題分類作為學科指定的系統,共143個分類,每種期刊主要被指定到一個分類,但研究上僅考慮超過250筆參考文獻的類別,並且以參考文獻所占比例較多的類別做為核心學科,此外也預測有潛力的學科。計算上使用兼顧引用文獻分布的學科數量和集中性(concentration)以及學科間相似性Rao-Stirling指數 (Rafols and Meyer, 2010)來分析COC。本研究使用相對開放性(relative openness)指標(Lee等人, 2009)來衡量引用學科的影響,並通過查看這些學科的影響的變化來探索科際整合的演變。

在視覺化部分,本研究使用河流圖表現核心學科長期變化,河流的寬度變化表示學科影響力的增加或減少。將按照時間順序出現學科可視化是另一種說明BMB科際整合的方法,本研究選擇引用次數等於或超過250次的學科探究BMB的新興學科。另外,本研究以VOSveiwer產生科學地圖,圖上的每一個單位為NSF主題分類的143個學科,其距離與大小由10年間(2003~2012年)的學科共被引矩陣計算,並依據NSF的14個主要學科分類著上顏色。




本研究的結果顯示,在100年間每年BMB引用的學科從1成長到93,跨越12個主要學科分類,只有Humanities和Arts兩個主要學科分類未曾被引用主要學科分類中較重要的有Clinical Medicine、Biomedical Research和Biology,另外,Physics和Chemistry也相當重要。從引用學科增長的數量,可見BMB在100年期間愈來越具有跨學科性。


比較各學科被BMB引用的比例(ri)以及它們的論文占所有發表論文的比例(pi),本研究確認出15個核心學科:



核心學科的首次引用大約都出現在1950年之前,較晚出現的核心學科大都是新興的學科。並且在科學地圖上,核心學科通常被映射在BMB的附近,也就是說與BMB較近的學科在早期便被BMB引用。

雖然BMB的引用主要來自本身(45.32%),但引用的比例明顯在下降中,1910年為74.0%,但2012年只有32.1%。除了BMB本身以外,一般生物醫學研究(General Biomedical Research)是另一個最多引用的學科。然而隨時間改變,不同的時期有不同重要的科際整合學科。在早期,生理學(Physiology)、藥學(Pharmacology)和免疫學(Immunology)是最重要的學科,然而近期這些學科的重要性減少,取而代之的是細胞生物學(Cellular Biology)、細胞學和組織學(Cytology and Histology)、遺傳學(Genetics and Heredity)與癌症學(Cancer)等多種學科,以及一般生物醫學研究。










在剛開始的時期,BMB主要引用本身學科的論文,然後是化學(Chemistry, 1924)、臨床醫學(Clinical Medicine,
1937)和生物學(Biology)。這三個學科也是在整個NSF分類系統所形成的科學地圖上與BMB最接近的學科。隨著BMB的發展,距離較遠的學科也逐漸加入。根據BMB的跨學科發展順序,可以分為四個時期:第一時期為1910到1960年的50年間,這個時期共有來自生物醫學、臨床醫學、化學和生物學的18個學科,在這個時期內核心學科大多已經加入。第二時期為1961~1981年的20年,這時期新增了34個學科,這時期除了生物醫學和臨床醫學的引用大量增加之外,特別值得一提的是來自物理學的參考文獻增加,顯示BMB的跨學科範圍開始從它鄰近的學科向外擴大。第三時期則是從1982到2002年的20年,這個時期新增了27個學科,跨學科的範圍更擴大到工程與科技(Engineering and Technology)、心理學、地球與空間和數學。第四時期則是從2003年開始,共有16個新學科,特別一提的是這個時期加入的圖書資訊學(Library and Information Science),其原因是這個時期開始有研究利用書目計量學方法對BMB的研究進行評估。從四個時期的科學地圖也可以發現,BMB引用的學科愈來愈多元,而且從一開始使用的學科都是在科學地圖上較接近的學科,愈到後期愈多來自距離較遠的學科。




以下則是最近十年增長最快的前五學科,這些學科可以視為是具有與BMB具有跨學科潛力的學科。




在100年間,Rao-Stirling指標從約0.3變為約0.6,成長約2倍,與Porter and Rafols (2009)的研究相比,同一時間區間(1975~2005年),Porter and Rafols (2009)研究的6個學科,Rao-Stirling指標並沒有明顯增加,但BMB則約增加0.32倍。

本研究證實跨學科主要從相鄰領域向認知較遠的領域演變,並且BMB研究者引用其他學科文獻日益增加,而雖然引用較遠領域的比例較小,但正在顯著增加,因此是BMB較有跨學科潛力的學科



2017年11月21日 星期二

Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the Association for Information Science and Technology, 67(5), 1257-1265.

Zhang, L., Rousseau, R., & Glänzel, W. (2016). Diversity of references as an indicator of the interdisciplinarity of journals: Taking similarity between subject fields into account. Journal of the Association for Information Science and Technology67(5), 1257-1265.

本研究利用文章參考文獻列表上引用項目的主題領域的多樣性測量期刊的跨學科性,主題領域和次領域則是根據Leuven-Budapest (ECOOM)的主題分類綱要,並且從種類(variety)、平衡(balance)和差距(disparity)等方向,運用Hill-type true diversity測量多樣性。在應用於多個學科的各期刊進行測量其跨學科性後,本研究也將檢驗跨學科性高的期刊是否具有更大的知名度與影響力。

對於多樣性(diversity)的研究中,一般多考慮分佈的均勻性和不同種類的數量是其兩個主要維度,Stirling(2007)建議加入網路結構的差異使得多樣性概念更為精確,引此導出Rao-Stirling多樣性測量。然而 在Leydesdorff and Rafols (2011)和Zhou, Rousseau, Yang, Yue, and Yang (2012)的研究,Rao-Stirling多樣性測量都沒有比過去使用的Shannon熵(entropy)或是Gini係數更好的效果。

美國國家科學院(the National Academies of the USA)對跨學科性的定義,跨學科研究是團隊或個人研究的一種模式,它整合了來自兩個或更多學科或專業知識體系的資訊、資料技術工具觀點概念和/或理論,以促進基本理解或解決超越單一學科的問題Rafols和Meyer(2010)根據這個定義為研究跨學科性提供了一個框架,他們將多樣性思想與網絡一致性(network coherence)相結合,產生相應的視覺化圖形

Jost (2009)指出true diversity須滿足六項要求:
1. 對稱性(symmetry)
2. 增加值為0的種類,並不會改變多樣性的數值 (zero output independence)
3. 轉移原則(transfer principle),將稀有的種類中較豐富者轉移到較一般者將會降低多樣性。
4. 同質性(homogeneity),多樣性僅取決於種類間的相對頻率,而不為種類上的絕對豐富程度所決定
5. 複製原則(replication principle),假設m個群體具有相同的物種豐度集合,但沒有任何物種在任何群體之間共享,所有的m群體必然具有相同的多樣性D0。而且當匯集m個群體,此時全體的多樣性為mD0
6. 正規化(Normalization),如果多樣性機制應用於N個同樣常見的種類,則其值為N.


在符合上述的六項要求下,多樣性的比例才有意義。

本研究援引Leinster & Cobbold (2012)提出測量方式。這個方式符合下列要求:
1. N個同樣豐富,完全不相似的物種的多樣性是N。
2. 假設社區被劃分為m個子群體,子群體之間沒有物種共享,而不同的子群體的物種完全不同。那麼群體的多樣性完全取決於群體的規模和多樣性。
3. 此外,如果這些m個子群體的大小相等,D0相等,那麼整個群體的多樣性就是m·D0。
4. 多樣性並不因原來列出的物種的順序而改變。
5. 增加一個新的物種但數量為0,多樣性是不變的。
6. 如果兩個物種相同,合併後,多樣性不變。
7. 當物種間的相似性增加,多樣性隨之減少。
8. 不考慮物種間的相似性時,其多樣性較大。
9. N個物種的多樣性在1與N之間。

利用ECOOM的16個主要主題領域和68個次領域做為多樣性評量的主題系統針對七種期刊的論文進行跨學科性測量。結果發現以主要主題領域和次領域分別做為評量類別所得到的結果有很大的差異:"Journal of the American Geriatrics Society"使用ECOOM次領域的測量結果名列第二,而在主要領域的結果中排名第五;與其相比,"Scientometrics"在主題系統從次領域向主要領域轉變時,排名明顯提高(從第四位躍升到第一位)。這結果表示,在以次領域為主的區域層次,Journal of the American Geriatrics Society較多樣性,但Scientometrics的參考文獻較廣布於廣域層次的主要領域。"Bioinformatics"則是在區域層次和廣域層次上都有很較高的多樣性。兩種數學期刊,如事先所料,在兩個層次的多樣性評量都不高。而比較Nature和Science兩個一般認為的多學科期刊,前者在兩個層次上的多樣性評量都較高。

Larivière和Gingras(2010)等人使用2000年發表的論文引用其他WoS類別(不是論文的期刊所屬的類別)的比例,做為跨學科性的指標,總體來說,論文的跨學科和它們收到的引用數量之間沒有相關性,但每個學科有所差異。對大多數學科來說,有一個最佳跨學科性層級,既不是最少也不是最多的跨學科的文章被引用最多。本研究則以Leinster & Cobbold (2012)提出的多樣性測量方式進行這項研究,並與論文發表後三年內收到的引用次數進行比較。研究結果發現:引用和多樣性之間的關係在期刊和期刊之間是非常不同的。在Nature與Science等兩個多學科期刊中,平均引用次數隨著多樣性的增加而增加,在4-6達到最高值Journal of the American Geriatrics Society 和 Scientometrics也有相似的引用行為。但Bioinformatics與上述期刊並不相同,隨著多樣性增加其平均引用次數呈下降趨勢。這兩本數學期刊則是隨著多樣性的增加,往往收到更多的引用。對抽象理論的學科而言,多樣性的增加似乎意味著更廣泛的適用性,進而導致引用的增加。


The objective of this article is to further the study of journal interdisciplinarity, or, more generally, knowledge integration at the level of individual articles.

Interdisciplinarity is operationalized by the diversity of subject fields assigned to cited items in the article’s reference list.

Subject fields and subfields were obtained from the Leuven-Budapest (ECOOM) subject-classification scheme, while disciplinary diversity was measured taking variety, balance, and disparity into account.

As diversity measure we use a Hill-type true diversity in the sense of Jost and Leinster-Cobbold.

Zhang, Glänzel, and Liang (2009, 2010) applied the entropy indicator to measure how far cross-citation links are spread among other journals, and compared the result with “centrality” measures. The authors found a clear divergence between strongly interlinked and high-entropy journals.

Rafols and Meyer (2010) provided a framework for the study of interdisciplinarity, where interdisciplinarity is understood as knowledge integration.

Most important, following Jost (2006, 2007, 2009) and Leinster and Cobbold (2012), we oppose the use of measures such as the Shannon entropy and the Rao-Stirling measure, and use their Hill-type numbers.

In order to be able to do so, we interpret interdisciplinarity as a kind of diversity and will first shed light on the mathematical background of measuring several aspects that are usually associated with diversity, namely, variety, balance, and disparity.

Stirling (2007) and Leinster and Cobbold (2012) pointed out that the notion of diversity has three components: variety, balance, and disparity. Each of them, considered separately, is necessary but not yet sufficient to measure diversity in an adequate manner. Neglecting one of these three aspects may distort the final assessment of diversity.

Variety is the number of nonempty categories to which system elements are assigned. In particular, it is the answer to the question: How many types of things do we have? In information science it may be the answer to the question: In how many different journals has this author published? Assuming that all things are equal, the greater the variety, the greater the diversity.

Balance is a function of the pattern of the assignment of elements across categories. It is the answer to the question: What is the relative number of items of each type? Balance is also called evenness (in ecology) and concentration (in economics). Evenness can be represented by the Lorenz curve (Nijssen et al., 1998). The Gini index is a well-known concentration or evenness measure (actually if G denotes the Gini concentration measure, then g = 1-G is the corresponding measure of evenness). In information science one may consider, for instance, how many articles an author has published in each journal. All else being equal, the more balanced the distribution, the larger the diversity.

Disparity refers to the manner and the degree in which things may be distinguished. It is the answer to the question: How different from each other are the types of things that we observe? For instance, publishing only in library and information science (LIS) journals shows less disparity than publishing in LIS and management and economics journals. All else being equal, the higher the disparity, the greater the diversity.

Mathematically speaking, variety is a positive, natural number as categories are numbered in sequence; balance is a function of fractions summing up to one, and disparity is a function of a matrix of distances (or similarities).

The problem now is how to find a single index that can aggregate properties of variety, balance, and disparity in a meaningful way and without much loss of information.

The interdisciplinarity of a publication is operationalized by the diversity of subject classifications over the publication’s references. Measures of diversity are calculated for each publication by classifying its references into one or more disciplines.

• Variety corresponds to the number of subject fields to which references of an individual paper can be assigned. A publication will have a high variety if its references are assigned to many different subject fields.

• Balance describes the evenness of the distribution of the subject field classifications. A publication will have high balance if the proportion of references is evenly distributed across categories (e.g., three for cell biology, three for physiology, and three for microbiology) and low balance if they are unevenly distributed (10 for cell biology, one for physiology, and one for microbiology).

• Disparity is taken into account by the distance between subject fields the references have been assigned to.

It seems that for an abstract-theoretical field, such as mathematics and logic, with generally low diversity, an increase in diversity points to a broader applicability and hence an increase in citations. 

2017年11月16日 星期四

Gowanlock, M., & Gazan, R. (2013). Assessing researcher interdisciplinarity: A case study of the University of Hawaii NASA Astrobiology Institute. Scientometrics, 94(1), 133-161.

Gowanlock, M., & Gazan, R. (2013). Assessing researcher interdisciplinarity: A case study of the University of Hawaii NASA Astrobiology Institute. Scientometrics94(1), 133-161.

本研究結合書目計量學技術與機器學習演算法評估夏威夷大學NASA天體生物學研究中心(UHNAI)的研究跨學科性(interdisciplinarity of research)。以UHNAI發表的論文資料為研究的單位,彙整論文本身的摘要和其引用的參考文獻的摘要代表該論文,UHNAI團隊共計有731篇論文,這些論文引用的相關摘要則有10216筆,另外根據其引用的期刊主題分類(Journal Subject Categories)分布製作13個合併主題分類(conflated SCs)。探討的主要問題包括:(1)評估WoK的期刊主題分類是否適合用來標註天體生物學的文獻;(2)利用合併的主題分類確認天體生物學實際與潛在的跨學科研究案例;(3)利用代表UHNAI團隊研究軌跡(research tracks)的彙整摘要確認天體生物學實際與潛在的跨學科研究案例,並確認研究人員間的潛在合作機會。

本研究採用Porter et al. (2007)依據美國國家科學院(National Academies)2005年對跨學科研究(interdisciplinary research)和多學科研究(multidisciplinary research)的定義,前者係整合來自兩個或以上學科的概念、理論、技術與(或)資料,後者則不須跨學科整合,僅需要採用其他專業知識體系的元素,導引出大於各部分總和的研究。

根據van Leeuwen (2007)的整理,研究的跨學科性可分為兩種書目計量取向,一為考慮較高層次的出版物聚合,例如國家或大學等的研究產出,通常利用現成的主題分類(SCs),將它視為是學科界限(disciplinary boundaries),進行由上而下(top-down)的分析,例如van Raan and van Leeuwen (2002) 和 Porter et al. (2007)利用SCs做為測量作者、期刊或研究領域跨越科學領域的基準;另一為單一文獻及其引用論文的由下而上(bottom-up)分析,根據作者在文件題名、摘要、關鍵詞或全文上的詞語,描述研究人員,期刊或整個領域的結構,並建議有效的未來方向。由於本研究關注未來的整合,而不是過去的產出,因此採用後者的方式。過去這方面的研究有 Kostoff et al., (2001)以引用與被引用論文的文字欄位,進行片語(phrase)的頻率和集群分析,了解研究的影響力和跨學科性;以及Rafols and Meyer (2010)結合由上而下與由下而上的方式測量學科多樣性與知識的整合。

另外,在分析跨學科邊界的合作可能性時,一般會去採訪領域專家。但這種方法受到樣本大小與主觀性等限制(Zhang et al., 2011),另外分析像是天體生物學這樣跨越多個學科的主題時,需要相當博學專家的知識。在考慮上述的限制後,本研究建議天體生物學的科際整合性分析,應由一或多個精通該主題的人員指導,但其專業知識不必要涵蓋所有的組成學科。因此,本研究採用不需要先備知識的非監督式方法來發現資料的趨勢,該方法為(Slonim et al., 2002)提出的sIB (sequential Information Bottleneck)文本集群分析。透過文本集群和分類可以描述合作與知識整合等現象,對於研究軌跡底層結構的揭示能夠提供對天體生物學研究者有用的結論。

本研究的結果顯示經由對摘要資料進行文本探勘產生的集群通常與SCs不太一致。 因此,本研究認為SCs不太適合應用於天體生物學出版物的分類,並且推測其他科際整合領域也是如此。其原因一個解釋是天體生物學研究成果會引用單學科和跨學科的出版物,可能會阻止SCs形成凝聚性的集群。此外,正如Small(2010)中所討論的,許多期刊發表了高度多樣化的內容,以期刊做為分類層級的系統並不能完全表現。10個集群是最合適的分類結果。太少的集群,無法表現來源文件的學科整合多樣性;過多的集群,則可能會太分散,從而減少發現來自不同學科和主題的共同性的機會。本研究建議,當來自不同SCs的文件聚集在一起時,這可能表明隱含的跨學科聯繫,某一個領域的知識可能對另一個領域有啟發的效用。由組成學科的研究人員評估這些共同的文件,可能可以提供一個科際整合科學發生的機制,並提供了一個潛在的跨學科合作的起點。而且利用sIB進行彙整摘要資料的文本探勘也適合用於發現合作的機會,本研究發現來自相同學術部門的作者的論文較有可能集群在一起,這也證實本研究使用的方法所產生的集群內能聚集相似的論文。因此,論文在同一集群下的作者可能可以進行生產性較高的合作。而論文分散在多個集群的作者可能表示他們有參與跨學科研究。研究也發現UHNAI的研究人員和博士後研究人員的論文大部分出現在多個集群中,是跨學科研究的主要族群。

In this study, we combine bibliometric techniques with a machine learning algorithm, the sequential Information Bottleneck, to assess the interdisciplinarity of research produced by the University of Hawaii NASA Astrobiology Institute (UHNAI).

In particular, we cluster abstract data to evaluate Thomson Reuters Web of Knowledge subject categories as descriptive labels for astrobiology documents, assess individual researcher interdisciplinarity, and determine where collaboration opportunities might occur.

Following van Leeuwen (2007), we distinguish between a top-down bibliometric approach, where large-scale trends at the highest levels of publication aggregation are considered (such as the research output of a country or university), and prefer a bottom-up approach, where we analyze individual documents and the papers they cite.

A common method used to examine the potential of collaboration across disciplinary boundaries is to interview domain experts, but this method suffers from several limitations, such as sample size and subjectivity problems (Zhang et al., 2011). Furthermore, given that the subject matter of astrobiology spans many disciplines, meaningful analysis of the responses would require the knowledge of an astrobiology polymath.

After considering these limitations, we suggest that measuring interdisciplinarity should be guided by one or more individuals versed in astrobiology, but whose expertise need not span all of its constituent disciplines. Therefore, an unsupervised approach is optimal as such methods can find trends in data without prior knowledge of its structure.

In this pilot study, we investigate the use of an unsupervised machine learning clustering technique, the sequential Information Bottleneck (sIB) (Slonim et al., 2002) to aid in measuring researcher interdisciplinarity.

Furthermore, we assess the extent to which Journal Subject Categories from the Thomson Reuters Web of Knowledge database suite are sufficient for labelling astrobiology documents.

The clustering and classification of text allow interdisciplinary analysis that 1) describes collaboration and the integration of knowledge and 2) draws conclusions that are useful to astrobiology researchers by uncovering the underlying structure of research tracks.

The multidisciplinary context given by astrobiology affords an excellent opportunity to examine the methods used to study researcher interdisciplinarity and knowledge integration.

Furthermore, we propose an iterative process to identify specific publications that bridge diverse fields, to facilitate interdisciplinary collaborations and ease the cognitive load of a single researcher who wishes to integrate knowledge from multiple disciplines.

Research that occurs at the intersection between disciplines is thought to lead to great advances in science (Porter and Rafols, 2009).

We adopt the definition suggested by Porter et al. (2007), which followed the definition given by the National Academies (2005): interdisciplinary research requires an integration of concepts, theories, techniques and/or data from two or more bodies of specialized knowledge. Multidisciplinary research may incorporate elements of other bodies of specialized knowledge, but without interdisciplinary synthesis (Wagner et al., 2011) that leads to research that is greater than the sum of its parts.

The usefulness of bibliometric indicators depends critically on the level at which we wish to understand the integrative process. For example, funding agencies may only require high-level publication co-authorship and collaboration statistics, describing the research performed by their grantees and the diversity of their home disciplines, but not addressing the essential aspect of synthesis.

Top-down approaches have been used to map scientific literature (for example, see Boyack et al. (2005)), and often represent broad areas of science with Web of Knowledge (WoK) subject categories (SCs). For example, van Raan and van Leeuwen (2002) and Porter et al. (2007) used SCs in their methodology to measure interdisciplinarity. In these studies, SCs have been employed as de facto disciplinary boundaries, and as a benchmark to measure how much a given author, journal or research area crosses scientific fields.

Unfortunately, low-level conclusions that might inform potentially productive individual collaborations cannot be made when relying on these top-down approaches, as they focus on past outputs rather than future integration.

Conversely, bottom-up bibliometric approaches incorporate the authors’ own words, in free-text fields such as: titles, abstracts, keywords1 and the full text of a document. Clustering bibliometric data at this level can describe the structure of a researcher, journal or an entire field, and suggest productive future directions.

A study by Rafols and Meyer (2010) combines bottom-up and top-down approches to measure both disciplinary diversity and knowledge integration.

While bibliometric studies tend to rely on a citation analysis, such an analysis may not be appropriate for every discipline or field. For example, a given field may tend to reference conference proceedings, websites, newspapers, or colloquia which are not as conducive to a co-citation analysis as journal articles. Due to this observation, Sugimoto (2011) suggests that studying interdisciplinarity should include publications beyond journal articles.

One of the goals of this research is to uncover the underlying structure within an astrobiology research team that undertakes interdisciplinary projects at the macro scale, but may differ in the extent of interdisciplinary work at the micro level.

To understand the research structure, we examine the abstract text of research publications and employ a method from the field of information theory, the sIB method, to cluster our high dimensional abstract data.

An advantage of using WoK for bibliometric studies is that it provides a mapping of SCs to each journal. Given the incommensurability of other bibliometric data (for example, journals do not agree upon a common set of keywords), SCs provide a way to compare publications on the journal level.

In Porter et al. (2007), the authors examine the references in sets of journal articles gathered from WoK, and relate the journals to their corresponding SCs. In this approach, a more diverse set of SCs that represent a paper derived from its references indicates a higher degree of interdisciplinarity than a set of similar SCs that represent a paper.

In particular, we combine all of the abstracts of all of the references cited by a UHNAI publication, and use these aggregated abstracts to represent each publication.

In another text mining study (Kostoff et al., 2001), employed free-text fields (such as title, keywords and abstracts) of cited/citing publications in combination with phrase frequency analysis and phrase clustering analysis to obtain a low-level understanding of research impact and interdisciplinary research.

In the following subsections, we describe our methods used to achieve the following goals:
• Examine whether WoK SCs are sufficient for labelling astrobiology documents.
• Identify actual and potential instances of interdisciplinary research in astrobiology using conflated SCs (Section 3.3).
• Identify actual and potential instances of interdisciplinary research and identify potential collaboration opportunities between researchers using aggregated abstracts to represent the research tracks of the UHNAI team (Section 3.4).

We chose this clustering method over others because it has been shown to perform better than other unsupervised clustering methods, such as k-means (Slonim et al., 2002). Furthermore, the approach should allow us to identify instances of interdisciplinary research by examining the cluster membership of our abstract data without prior knowledge of the data’s properties. It is necessary to use an unsupervised clustering method because a canonical set of astrobiology documents with which to train a clustering technique does not exist.

We modify the SCs using the following method:
• Journals with a single WoK SC that appears 10 or more times in our dataset uses the assigned WoK SC name.
• Journals with a single WoK SC that appears less than 10 times is changed to a broader WoK category (e.g. “Biochemical Research Methods” becomes “Biochemistry & Molecular Biology”).
• Journals with two or more SCs of roughly equivalent weight are assigned a new conflated SC (e.g. “Astrophysics & Geophysics”).
• Journals with two or more SCs that have a clear primary SC have “-Multidisciplinary” appended to the primary name.

The dataset has 10216 abstracts integrated over 13 conflated SCs.

We use the Synthetic Minority Over-sampling Technique (SMOTE) (Chawla et al., 2002) to produce synthetic feature vectors, where a feature vector (or feature) is a normalized numerical representation of the words that describe each abstract/instance.

We use SMOTE to create synthetic feature vectors for the minority SCs such that each SC is represented by the same number of features.

The dataset contains 731 publications by the UHNAI team.

Each publication is represented by its own abstract and the abstract of each cited publication. We aggregate all of these abstracts in a single feature vector to represent each UHNAI publication. Non-journal publications such as book chapters, conference proceedings and dissertations were included in the dataset, although they constitute a very small fraction of the total publications.

For the purposes of this paper, where our goal is to identify actual and potential instances of interdisciplinary research in astrobiology, a meaningful cluster relationship is one where papers from two or more SCs cluster together, or when researchers from different fields have the aggregated abstracts of their papers cluster together.

We begin by estimating the extent to which conflated Web of Knowledge Subject Categories accurately describe the content of astrobiology publications.

However, when abstracts are assigned one of five clusters (Figure 4-top panel), we observe that the cluster membership for most SCs is heterogeneous: there is no clear correspondence between a cluster and a single dominant SC. Even the most common SC, Astronomy & Astrophysics, is primarily distributed across the first three clusters, but is represented in all five.

Table 5 does suggest some areas in which SCs may be more appropriate document labels. For example, Oceanography appears in only one cluster, and the Multidisciplinary Sciences SC is fairly evenly distributed across four of the five. However, when increasing the number of clusters to 10, 15, and 20 (Figure 4), the heterogeneity of SCs within an individual cluster becomes even more pronounced.

One would intuitively expect more SC heterogeneity within each cluster; however, increasing the number of clusters also allows more potential of each SC to dominate a single cluster. When we increase the number of clusters to 10 (Figure 6, Table 7), we find that most of the SCs disperse into multiple clusters.

One way to interpret this result is that more clusters allow finer distinctions between content to be revealed.

At the 10 cluster level, more clusters contain single dominant SCs than at the 5 cluster level.

Figure 7 (Table 8) and Figure 8 (Table 9) present the results of clustering the abstracts into 15 and 20 clusters, respectively. We observe that many of the SCs are found distributed in multiple clusters.

Therefore, at these clustering levels, we operationalize a dominant SC within a cluster as one that either constitutes 50% or more of the abstracts alone, or one that is within 50% of the size of the most common SC5 .

By this approximation, the results at the 10 cluster level hold: as a group, the Biochemistry and Biotechnology-related SCs dominate the fewest clusters; the Astronomy, Oceanography and Physics group slightly more, and the Geochemistry and Geophysics SCs are again the most diverse, short of the Multidisciplinary Sciences SC.

Overall, at the 10 cluster level, more clusters contain single dominant SCs than at the 5, 15 or 20 cluster levels, and the usefulness of SCs as document labels reaches a relative maximum.

In some cases, the trial processes reveal some inconsistencies in the cluster membership of SCs.

Certain related SCs tend to consistently cluster together, which suggests that SCs are sufficient for characterizing astrobiology publications. However, other SCs have a limited effectiveness as document labels in this interdisciplinary domain, as some SCs did not map well to successively smaller cluster sizes.

Therefore, our results suggest that WoK SCs may not consistently reflect the diverse content of astrobiology publications.

Across all three trials at the 10 cluster level in Figure 6, a single clearly dominant SC could be identified in 27 of the 30 clusters. The Astronomy, Oceanography and Physics SCs demonstrated somewhat less monodisciplinary dominance at the 10 cluster level; all had roughly 20% of their abstracts assigned to other clusters. The Geochemistry & Geophysics and Environmental Sciences SCs demonstrated the most diversity apart from the pure Multidisciplinary Sciences SC, though somewhat surprisingly, the Geochemistry & Geophysics-Multidisciplinary SC appeared in fewer clusters than its core SC.

Analyzing the heterogeneous cluster membership of publications from diverse SCs is one way to assess interdisciplinary research possibilities, but the probabilistic nature of this method should be emphasized. A heterogeneous cluster could indicate that SCs are poor document labels, or that the clustering level should be adjusted to better match the data and metadata, or that a potential interdisciplinary relationship exists. In either case, this process could inform targeted, iterative investigation.

This result suggests that the sIB technique is able to cluster similar research on a high-level; however, utilizing more clusters should provide a lower-level view of overlap in research interests between the authors

When running the sIB technique for 10 clusters, we begin to see where researchers may find potential collaboration opportunities, and we observe which authors have specialized or broad research interests. Research can be specialized but still integrate methods, techniques and data from multiple disciplines. We believe that an author who is represented primarily in a single cluster may not be engaging frequently in interdisciplinary research, or may be focusing on narrow research problems, or using similar research methods or equipment. In Figure 10, we see that the two astrochemists (Bennett and Kaiser) are entirely represented by cluster 8, consistent with the results presented in Figure 9. We know that their research is heavily influenced by their experimental apparati, thus suggesting that the experimental methods and apparati significantly affect the description of a research track. Interestingly, Sch¨orghofer’s research is on various planetary bodies such as Mars and the Moon, which is also true of Taylor. Therefore, clustering the text of the aggregated abstracts sufficiently illuminates similarities in research tracks across disciplinary boundaries, in this case, between astronomy and geology.

In Figure 11, we observe that Huss, Jewitt, Krot and Meech’s research is found in many clusters. This signifies that their research is likely to be very interdisciplinary. With regards to those authors represented by a few clusters, we cannot conclude that their research is absolutely mono-disciplinary, as it may be very specialized, or utilize the same methods or apparati. However, we believe that those UHNAI authors with publications in multiple clusters are more likely to be engaged in interdisciplinary research. In Figure 12, we observe that of the senior (non-postdoctoral fellows) astronomers (Reipurth, Meech, Jewitt, Haghighipour, Owen, Sch¨orghofer) half (Meech, Jewitt, and Owen) are fairly diverse in their research interests and the other half (Reipurth, Haghighipour, Sch¨orghofer) are engaged in specialized or mono-disciplinary research.

These results suggest that the sIB method, in combination with aggregated abstracts, can illuminate areas of implicit commonality where the research areas of scientists from diverse disciplines overlap. Furthermore, while clusters do not inherently relate any information about a researcher’s discipline, it is clear that researchers from the same department often cluster together. Therefore, we expect that performing a similar analysis on the entire NASA Astrobiology Institute will show where collaborations between researchers can occur, and can assist NASA with outlining research priorities. These results can serve as the framework for a geospatial visualization of common yet unconnected research tracks and potential collaborators, similar to the “hot regions” described by Bornmann and Waltman (2011).