PRBench: 物理学研究におけるエンドツーエンド論文再現性評価

要旨

大規模言語モデルを基盤とするAIエージェントは、強力な推論能力と問題解決能力を示し、数式導出やコード生成などの科学研究タスクを支援することができる。しかし、これらのエージェントが実際の科学論文からエンドツーエンドの再現を確実に実行できるかどうかは未解決の問題である。本研究では、物理学の11の分野にわたる専門家精選の30タスクから構成するベンチマークPRBenchを提案する。各タスクは、公開論文の方法論を理解し、対応するアルゴリズムをゼロから実装し、元の論文と一致する定量的結果を生成することをエージェントに要求する。エージェントにはタスク指示と論文内容のみが提供され、サンドボックス化された実行環境で動作する。全てのタスクは北京大学物理学院の20以上の研究グループのドメイン専門家によって貢献され、それぞれが実際の論文に基づき、検証済みの正解結果と詳細な評価基準を用いたエンドツーエンド再現によって検証されている。エージェント化された評価パイプラインを用いて、一連のコーディングエージェントをPRBenchで評価し、科学的推論と実行の主要次元にわたる能力を分析した。最高性能のエージェントであるGPT-5.3-Codexを搭載したOpenAI Codexは、平均総合スコア34%を達成した。全てのエージェントはエンドツーエンド再現成功率が0%であり、データ精度とコード正確性で特に低い性能を示した。さらに、数式実装の誤り、数値シミュレーションのデバッグ不能、出力データの捏造といった体系的な失敗モードを特定した。総合的に、PRBenchは自律的な科学研究に向けた進歩を評価する厳密なベンチマークを提供する。

English

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

PRBench: 物理学研究におけるエンドツーエンド論文再現性評価

PRBench: End-to-end Paper Reproduction in Physics Research

要旨

Support