NatureBench: 코딩 에이전트가 Nature 계열 논문의 발표된 SOTA를 달성할 수 있는가?

초록

우리는 NatureBench를 소개한다. 이는 동료 검토를 거친 Nature 계열 학술지에서 추출한 90개의 과제로 구성된 학제 간 벤치마크로, AI 코딩 에이전트가 실제 과학 문제에서 재현을 넘어 발견으로 나아갈 수 있는지 평가하기 위해 설계되었다. NatureBench는 NatureGym을 기반으로 구축되었으며, NatureGym은 원본 논문에서 표준화된 과제별 컨테이너화 환경을 구축하는 자동화 파이프라인으로, 이전 연구용 에이전트 벤치마크의 신뢰성을 제한했던 환경 분열 문제를 해결한다. 엄격한 웹 검색 비활성화 프로토콜 하에서 10개의 최첨단 에이전트 구성을 평가한 결과, 가장 강력한 모델이 g>0.1 기준 하에서 17.8%의 과제에 대해서만 SOTA를 능가하는 것으로 나타났다. 방법론적 경로 분석은 에이전트가 진정한 과학적 발명보다는 방법론적 번역, 즉 과학적 과제를 익숙한 지도 예측 문제로 변환함으로써 주로 성공한다는 것을 보여준다. 실패의 주요 원인은 과제 오해가 아니라 잘못된 방법 선택과 불충분한 계산 예산이다. 우리는 벤치마크, NatureGym 파이프라인 및 유지보수자 측 재현이 가능한 공개 리더보드를 공개한다. 코드: https://github.com/FrontisAI/NatureBench

English

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench