NatureBench: コーディングエージェントはNature系列論文の公表SOTAに匹敵できるか？

要旨

私たちはNatureBenchを紹介する。これは、査読付きNatureファミリー論文から抽出された90のタスクからなる学際ベンチマークであり、AIコーディングエージェントが再現から発見へと進み、現実の科学的問題に取り組めるかを評価するために設計された。NatureBenchはNatureGymに基づいて構築されており、NatureGymはソース論文からタスクごとに標準化されたコンテナ化環境を自動構築するパイプラインであり、従来のエージェント研究ベンチマークの信頼性を制限してきた環境断片化問題に対処する。厳格なWeb検索禁止プロトコルの下で10の最先端エージェント構成を評価した結果、最も強力なモデルでもg > 0.1の基準下でタスクのわずか17.8%しかSOTAを超えなかった。手法の経路分析により、エージェントが成功する主な要因は、真の科学的発明ではなく、科学タスクを馴染み深い教師あり予測問題に変換する方法論的翻訳であることが明らかになった。失敗の大半はタスクの誤解ではなく、誤った手法の選択と不十分な計算リソースに起因する。私たちはベンチマーク、NatureGymパイプライン、およびメンテナ側による再現を伴う公開リーダーボードを公開する。コード: https://github.com/FrontisAI/NatureBench

English

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench