NatureBench：编程代理能否达到自然系列已发表论文的最优水平？

摘要

我们提出NatureBench，这是一个跨学科基准测试，包含90个从经同行评审的Nature系列刊物中提取的任务，旨在评估AI编程智能体能否在真实科学问题上从复现迈向发现。NatureBench基于NatureGym构建，后者是一个自动化流水线，能够从源论文中构建标准化、每个任务独立的容器化环境，解决了此前研究型智能体基准测试中因环境碎片化而影响可信度的问题。在严格禁用网络搜索的协议下评估了前沿智能体配置后，我们发现最强模型在g>0.1准则下仅超越17.8%任务的最优表现。方法路径分析显示，智能体主要依赖方法论转化——将科学任务转换为熟悉的监督预测问题——而非真正的科学发明。失败原因主要在于方法选择错误和计算资源不足，而非任务理解偏差。我们公开了该基准测试、NatureGym流水线以及一个支持维护方复现的公开排行榜。代码：https://github.com/FrontisAI/NatureBench

English

We introduce NatureBench, a cross-discipline benchmark of 90 tasks distilled from peer-reviewed Nature-family publications, designed to evaluate whether AI coding agents can move beyond reproduction toward discovery on real scientific problems. NatureBench is built on NatureGym, an automated pipeline that constructs a standardized, per-task containerized environment from a source paper, addressing the environment-fragmentation problem that has limited the credibility of prior agent-on-research benchmarks. Evaluating ten frontier agent configurations under a strict web-search-disabled protocol, we find that the strongest model surpasses SOTA on only 17.8% of tasks under the g>0.1 criterion. Analysis of method pathways reveals that agents succeed primarily through methodological translation, converting scientific tasks into familiar supervised prediction problems, rather than through genuine scientific invention. Failures are dominated by wrong method choice and insufficient compute budget, not by task misunderstanding. We release the benchmark, the NatureGym pipeline, and a public leaderboard with maintainer-side reproduction. Code: https://github.com/FrontisAI/NatureBench