PRBench:物理学研究中的端到端论文复现
PRBench: End-to-end Paper Reproduction in Physics Research
March 29, 2026
作者: Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu
cs.AI
摘要
基于大语言模型的智能体展现出强大的推理与问题解决能力,能够辅助完成公式推导、代码生成等科研任务。然而,这些智能体能否可靠地实现真实科学论文的端到端复现仍属未知。我们推出PRBench基准测试,涵盖物理学11个子领域的30项专家评审任务。每项任务要求智能体理解已发表论文的方法论,从零实现对应算法,并生成与原始出版物匹配的量化结果。智能体仅获取任务说明与论文内容,并在沙箱化执行环境中运行。所有任务均由北京大学物理学院20余个课题组的领域专家贡献,均以真实发表论文为基础,并通过端到端复现验证,配备经过核验的真实结果与详细评分标准。通过智能体化评估流程,我们对一批代码生成智能体进行PRBench测试,并从科学推理与执行等关键维度分析其能力。表现最佳的OpenAI Codex(基于GPT-5.3-Codex)平均综合得分为34%。所有智能体的端到端回调成功率均为零,在数据准确性与代码正确性方面表现尤差。我们进一步识别出系统性失效模式,包括公式实现错误、数值模拟调试能力缺失及输出数据伪造等问题。总体而言,PRBench为评估自主科研能力进展提供了严谨的基准。
English
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.