SoundnessBench: あなたのAI科学者は本当に良い研究アイデアと悪い研究アイデアを見分けられるか？

要旨

自律型AI研究エージェントは、仮説生成から査読に至る研究パイプラインを自動化することで、科学的発見を加速することを目指している。しかし、既存のベンチマークは、時間と計算リソースを費やす前に大規模言語モデルが研究アイデアの方法論的実現可能性を判断できるかという根本的なボトルネックをほとんどテストしていない。我々は、ICLR投稿から再構築された1,099件の機械学習研究提案からなる厳選ベンチマークSoundnessBenchを導入する。これには査読者の健全性サブスコアがラベル付けされており、元の論文に対して監査が行われている。SoundnessBenchは、論文全体の査読結果を正確に予測するものではなく、回復可能な提案段階の健全性のベンチマークとして解釈されるべきである。12の最先端LLMにわたって、我々は広範な楽観バイアスを発見した。標準的なプロンプトでは、モデルは低健全性の提案を頻繁に健全と評価する一方、攻撃的なプロンプトは誤りを偽陽性から偽陰性へと大きくシフトさせる。公開コーパスの汚染、論文識別フレーズ、表面的特徴、および人間による監査品質に対する追加の統制は、この行動が単一の交絡因子では説明されないことを示唆している。我々の結果は、現在のLLMが科学的厳密性のための独立した第一ゲート評価者としてまだ信頼できないことを示している。

English

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.