ChatPaper.aiChatPaper

SoundnessBench:你的AI科学家真的能区分研究创意的好坏吗?

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

May 28, 2026
作者: Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang
cs.AI

摘要

自主式AI研究代理旨在通过自动化研究流程(从假设生成到同行评审)来加速科学发现。然而,现有基准测试很少检验一个根本性瓶颈:大型语言模型能否在耗费时间和计算资源之前,判断一个研究思路的方法论可行性。我们提出了SoundnessBench,这是一个精心构建的基准测试集,包含从ICLR投稿中重构的1099个机器学习研究提案,标注了评审者的方法合理性子评分,并对照源论文进行了审计。SoundnessBench应被解读为针对可恢复的提案阶段合理性的基准测试,而非对完整论文评审结果的精确预测。在12个前沿LLM的测试中,我们发现存在普遍的乐观偏差:在标准提示条件下,模型频繁将低合理性提案评为合理,而激进提示则主要将错误从假阳性转为假阴性。针对公共语料污染、论文识别短语、表面特征以及人工审计质量的额外控制实验表明,这一行为无法由单一混杂因素解释。我们的结果表明,当前LLM尚不足以作为独立的初审把关者来可靠评估科学严谨性。
English
Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.