SoundnessBench: 당신의 AI 과학자가 정말 좋은 연구 아이디어와 나쁜 아이디어를 구별할 수 있을까?

초록

자율적 AI 연구 에이전트는 가설 생성부터 동료 검토에 이르는 연구 파이프라인을 자동화하여 과학적 발견을 가속화하는 것을 목표로 한다. 그러나 기존 벤치마크는 시간과 계산 자원을 소비하기 전에 대규모 언어 모델이 연구 아이디어의 방법론적 실행 가능성을 판단할 수 있는지 여부라는 근본적인 병목 현상을 거의 테스트하지 않는다. 본 연구에서는 ICLR 제출 논문에서 재구성된 1,099개의 머신러닝 연구 제안서로 구성된 큐레이션 벤치마크인 SoundnessBench를 소개한다. 이는 검토자의 건전성 하위 점수로 레이블링되었으며 원본 논문에 대해 감사를 수행하였다. SoundnessBench는 전체 논문 검토 결과의 정확한 예측이 아닌, 제안 단계에서의 회복 가능한 건전성을 평가하기 위한 벤치마크로 해석되어야 한다. 12개의 최첨단 LLM을 대상으로 한 실험에서 보편적인 낙관 편향이 관찰되었다. 즉, 표준 프롬프팅 하에서 모델은 낮은 건전성의 제안을 건전하다고 평가하는 경우가 빈번했으며, 공격적인 프롬프팅은 주로 오류를 거짓 양성에서 거짓 음성으로 이동시켰다. 공개 코퍼스 오염, 논문 식별 구문, 표면적 특징 및 인간 감사 품질에 대한 추가 통제는 이러한 행동이 단일 혼란 변수로 설명되지 않음을 시사한다. 본 결과는 현재 LLM이 과학적 엄격성을 위한 독립적인 1차 게이트 평가자로서 아직 신뢰할 수 없음을 나타낸다.

English

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.