SoundnessBench：你的AI科学家真的能区分研究创意的好坏吗？

摘要

自主式AI研究代理旨在通过自动化研究流程（从假设生成到同行评审）来加速科学发现。然而，现有基准测试很少检验一个根本性瓶颈：大型语言模型能否在耗费时间和计算资源之前，判断一个研究思路的方法论可行性。我们提出了SoundnessBench，这是一个精心构建的基准测试集，包含从ICLR投稿中重构的1099个机器学习研究提案，标注了评审者的方法合理性子评分，并对照源论文进行了审计。SoundnessBench应被解读为针对可恢复的提案阶段合理性的基准测试，而非对完整论文评审结果的精确预测。在12个前沿LLM的测试中，我们发现存在普遍的乐观偏差：在标准提示条件下，模型频繁将低合理性提案评为合理，而激进提示则主要将错误从假阳性转为假阴性。针对公共语料污染、论文识别短语、表面特征以及人工审计质量的额外控制实验表明，这一行为无法由单一混杂因素解释。我们的结果表明，当前LLM尚不足以作为独立的初审把关者来可靠评估科学严谨性。

English

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.