ChatPaper.aiChatPaper

SoundnessBench:你的人工智慧科學家真的能區分好壞研究點子嗎?

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

May 28, 2026
作者: Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang
cs.AI

摘要

自主AI研究代理旨在透過自動化研究流程(從假設生成到同儕審查)來加速科學發現。然而,現有基準測試鮮少觸及一項關鍵瓶頸:大型語言模型在耗費時間與計算資源之前,能否判斷研究構想的方法論可行性。我們提出SoundnessBench,這是一個經過精心設計的基準測試,包含從ICLR投稿中重建的1,099份機器學習研究提案,附有審查者的嚴謹度子分數標註,並與原始論文進行核對。SoundnessBench應被解讀為針對可復現之提案階段嚴謹度的基準,而非對全文審查結果的精確預測。在測試12個前沿大型語言模型時,我們發現普遍的樂觀偏誤:在標準提示下,模型經常將低嚴謹度的提案評為嚴謹;而激進提示則大致將誤差從偽陽性轉移至偽陰性。針對公開語料庫污染、論文辨識詞語、表面特徵及人類審查品質的額外控制,顯示此行為無法由單一混淆變項解釋。我們的結果表明,目前的大型語言模型尚無法勝任科學嚴謹性評估的獨立第一道關卡。
English
Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.