研究基準台：透過啟發式任務分解來評測大語言模型在科學發現中的表現

摘要

大型語言模型（LLMs）已展現出輔助科學研究的潛力，然而，由於缺乏專用的基準測試，其發現高質量研究假設的能力尚未得到檢驗。為填補這一空白，我們引入了首個大規模基準，用於評估LLMs在科學發現中近乎完備的子任務集：靈感檢索、假設構建和假設排序。我們開發了一個自動化框架，從12個學科的科學論文中提取關鍵要素——研究問題、背景調查、靈感和假設，並通過專家驗證確認其準確性。為防止數據污染，我們僅專注於2024年發表的論文，確保與LLM預訓練數據的重疊最小。我們的評估顯示，LLMs在檢索靈感這一分佈外任務上表現出色，表明其能夠揭示新穎的知識關聯。這將LLMs定位為“研究假設礦場”，能夠通過大規模生成創新假設，以最少的人為干預推動自動化科學發現。

English

Large language models (LLMs) have demonstrated potential in assisting scientific research, yet their ability to discover high-quality research hypotheses remains unexamined due to the lack of a dedicated benchmark. To address this gap, we introduce the first large-scale benchmark for evaluating LLMs with a near-sufficient set of sub-tasks of scientific discovery: inspiration retrieval, hypothesis composition, and hypothesis ranking. We develop an automated framework that extracts critical components - research questions, background surveys, inspirations, and hypotheses - from scientific papers across 12 disciplines, with expert validation confirming its accuracy. To prevent data contamination, we focus exclusively on papers published in 2024, ensuring minimal overlap with LLM pretraining data. Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as "research hypothesis mines", capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

研究基準台：透過啟發式任務分解來評測大語言模型在科學發現中的表現

ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition

摘要

Support