ChatPaper.aiChatPaper

NSF-SciFy:從NSF獎項資料庫中挖掘科學主張

NSF-SciFy: Mining the NSF Awards Database for Scientific Claims

May 25, 2026
作者: Delip Rao, Weiqiu You, Eric Wong, Chris Callison-Burch
cs.AI

摘要

我們介紹NSF-SciFy,這是一個從美國國家科學基金會獲獎摘要中提取的科學主張與研究提案綜合數據集。相較於先前的科學主張驗證數據集在規模與範疇上的限制,NSF-SciFy代表了重大進展,包含來自40萬份摘要、橫跨所有科學與數學學科的280萬條主張。我們提出了兩個重點子集:NSF-SciFy-MatSci(包含來自材料科學獎項的11.4萬條主張)以及NSF-SciFy-20K(包含來自五個NSF理事會的13.5萬條主張)。透過零樣本提示(zero-shot prompting),我們開發了一種可擴展的方法,用於聯合提取科學主張與研究提案。我們透過三個下游任務展示了該數據集的實用性:非技術性摘要生成、主張提取與研究提案提取。在我們的數據集上微調語言模型帶來了顯著改善,相對增益常超過100%,特別是在主張與提案提取任務上。我們的錯誤分析顯示,提取的主張具有高精確率但召回率較低,這為進一步的方法論改進提供了機會。NSF-SciFy為大規模主張驗證、科學發現追蹤以及元科學分析開闢了新的研究方向。程式碼與數據可在 https://github.com/darpa-scify/NSFSciFy 取得。
English
We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.