NSF-SciFy：挖掘NSF奖项数据库中的科学声明

摘要

我们介绍NSF-SciFy，这是一个包含从美国国家科学基金会（NSF）项目摘要中提取的科学声明和研究提案的综合数据集。以往的科学声明验证数据集在规模和范围上有限，而NSF-SciFy取得了显著进展，包含从40万篇摘要中提取的280万条声明，涵盖所有科学和数学学科。我们提供了两个重点子集：NSF-SciFy-MatSci，包含来自材料科学项目的11.4万条声明；以及NSF-SciFy-20K，包含来自五个NSF理事会的13.5万条声明。我们采用零样本提示方法，开发了一种可扩展的科学声明与研究提案联合提取方法。我们通过三个下游任务展示了该数据集的实用性：非技术性摘要生成、声明提取和研究提案提取。基于我们数据集微调的语言模型取得了显著改进，相对提升通常超过100%，尤其在声明和提案提取任务上。我们的错误分析表明，提取的声明具有高精确率但召回率较低，这为方法的进一步改进提供了机会。NSF-SciFy为大规模声明验证、科学发现追踪和元科学分析等新研究方向奠定了基础。代码和数据可在https://github.com/darpa-scify/NSFSciFy获取。

English

We introduce NSF-SciFy, a comprehensive dataset of scientific claims and investigation proposals extracted from National Science Foundation award abstracts. While previous scientific claim verification datasets have been limited in size and scope, NSF-SciFy represents a significant advance with 2.8 million claims from 400,000 abstracts spanning all science and mathematics disciplines. We present two focused subsets: NSF-SciFy-MatSci with 114,000 claims from materials science awards, and NSF-SciFy-20K with 135,000 claims across five NSF directorates. Using zero-shot prompting, we develop a scalable approach for joint extraction of scientific claims and investigation proposals. We demonstrate the dataset's utility through three downstream tasks: non-technical abstract generation, claim extraction, and investigation proposal extraction. Fine-tuning language models on our dataset yields substantial improvements, with relative gains often exceeding 100%, particularly for claim and proposal extraction tasks. Our error analysis reveals that extracted claims exhibit high precision but lower recall, suggesting opportunities for further methodological refinement. NSF-SciFy enables new research directions in large-scale claim verification, scientific discovery tracking, and meta-scientific analysis. Code and data are available at https://github.com/darpa-scify/NSFSciFy.