研究基准：通过基于启发的任务分解评估大语言模型在科学发现中的表现

摘要

大型语言模型（LLMs）在辅助科学研究方面已展现出潜力，然而，由于缺乏专门的基准测试，其发现高质量研究假设的能力尚未得到检验。为填补这一空白，我们首次引入了一个大规模基准，用于评估LLMs在科学发现中近乎完备的子任务集：灵感检索、假设构建与假设排序。我们开发了一个自动化框架，该框架从涵盖12个学科的科学论文中提取关键要素——研究问题、背景调查、灵感及假设，并通过专家验证确保了其准确性。为防止数据污染，我们仅聚焦于2024年发表的论文，确保与LLM预训练数据的重叠最小化。评估结果显示，LLMs在灵感检索这一分布外任务上表现优异，表明其能够挖掘新颖的知识关联。这使LLMs定位为“研究假设矿场”，能够通过大规模生成创新假设，在最少人工干预下推动自动化科学发现。

English

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

研究基准：通过基于启发的任务分解评估大语言模型在科学发现中的表现

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

摘要

Support