PRBench:物理学研究中的端到端论文复现平台
PRBench: End-to-end Paper Reproduction in Physics Research
March 29, 2026
作者: Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li, Zeyu Li, Zhaolong Zhang, Huiwen Zheng, Leidong Bao, Anqi Lv, Zihan Mo, Yadi Niu, Yiyang Peng, Yu Tian, Yili Wang, Ziyu Wang, Zi-Yu Wang, Jiashen Wei, Liuheng Wu, Aoran Xue, Leyi Yang, Guanglu Yuan, Xiarui Zhan, Jingjun Zhang, Zifan Zheng, Pengfei Liu, Linrui Zhen, Kaiyang Li, Qichang Li, Ziheng Zhou, Guo-En Nian, Yunwei Xiao, Qing-Hong Cao, Linjie Dai, Xu Feng, Peng Gao, Ying Gu, Chang Liu, Jia Liu, Ming-xing Luo, Yan-Qing Ma, Liang-You Peng, Huichao Song, Shufeng Wang, Chenxu Wang, Tao Wang, Yi-Nan Wang, Chengyin Wu, Pengwei Zhao, Hua Xing Zhu
cs.AI
摘要
基於大語言模型的人工智慧代理展現出強大的推理與問題解決能力,使其能夠輔助公式推導和代碼生成等科研任務。然而,這些代理能否可靠地實現對真實科學論文的端到端復現仍是待解難題。我們提出PRBench基準測試,涵蓋物理學11個子領域的30項專家策劃任務。每項任務要求代理理解已發表論文的方法論,從零實現相應算法,並生成與原文相符的量化結果。代理僅獲取任務說明和論文內容,並在沙箱化執行環境中運行。所有任務均由北京大學物理學院20餘個課題組的領域專家貢獻,均基於真實發表論文,並通過端到端復現驗證,包含經核實的真值結果與詳細評分標準。採用代理化評估流程,我們對多個編程代理進行測試,從科學推理與執行的關鍵維度分析其能力。表現最佳的OpenAI Codex(GPT-5.3-Codex驅動)平均總得分為34%。所有代理的端到端回調成功率均為零,在數據準確性與代碼正確性方面表現尤為薄弱。我們進一步識別出系統性失效模式,包括公式實現錯誤、數值模擬調試能力缺失及輸出數據偽造等。總體而言,PRBench為評估自主科研能力進展提供了嚴謹的基準。
English
AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.