ChatPaper.aiChatPaper

ReplicationBench:AI智能体能否复现天体物理研究论文?

ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

October 28, 2025
作者: Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca
cs.AI

摘要

前沿人工智能代理作为科研助手的潜力日益显现,未来或能胜任长期开放的科研工作流程。然而要将代理应用于创新性研究,我们首先需要评估其工作的底层忠实度与正确性。为此我们推出ReplicationBench评估框架,通过测试代理能否复现天体物理学领域的研究论文来评估其科研辅助能力。天体物理学研究高度依赖档案数据和计算分析,几乎无需实体实验,这使其成为检验科研AI代理的理想试验场。我们将每篇论文拆解为若干任务,要求代理复现论文的核心贡献,包括实验设置、公式推导、数据分析和代码库构建。每个任务均与论文原作者合作设计,聚焦关键科学结论,从而实现对忠实度(遵循原始方法)和正确性(结果技术准确性)的客观评估。ReplicationBench对当前前沿语言模型极具挑战性:即使表现最佳的模型得分也低于20%。通过与领域专家共同分析任务执行轨迹,我们发现了科研代理存在丰富多样的失效模式。该基准首次建立了经专家验证的论文级天体物理研究任务体系,揭示了可推广至其他数据驱动科学领域的代理性能洞见,并为衡量AI代理在科研中的可靠性提供了可扩展的评估框架。
English
Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
PDF41December 1, 2025