PRBench: 물리학 연구를 위한 종단간 논문 재현

초록

대규모 언어 모델로 구동되는 AI 에이전트는 강력한 추론 및 문제 해결 능력을 보여주며, 이를 통해 공식 도출 및 코드 생성과 같은 과학 연구 작업을 지원할 수 있습니다. 그러나 이러한 에이전트가 실제 과학 논문에서 종단 간 재현을 안정적으로 수행할 수 있는지는 여전히 해결되지 않은 문제입니다. 본 연구에서는 물리학의 11개 하위 분야에 걸친 전문가가 선별한 30개의 과제로 구성된 벤치마크인 PRBench를 소개합니다. 각 과제는 에이전트가 게재된 논문의 방법론을 이해하고, 해당 알고리즘을 처음부터 구현하며, 원본 논문과 일치하는 정량적 결과를 생성할 것을 요구합니다. 에이전트는 과제 지시사항과 논문 내용만 제공받으며, 샌드박스 실행 환경에서 운영됩니다. 모든 과제는 Peking University 물리학부 소속 20개 이상의 연구 그룹의 도메인 전문가들이 기여했으며, 각 과제는 실제 게재 논문을 바탕으로 하며 검증된 기준 결과와 상세 채점 기준을 통해 종단 간 재현으로 검증되었습니다. 에이전트화된 평가 파이프라인을 사용하여 PRBench에서 일련의 코딩 에이전트를 평가하고 과학적 추론 및 실행의 주요 차원에서 그들의 능력을 분석합니다. 가장 높은 성능을 보인 에이전트인 GPT-5.3-Codex 기반의 OpenAI Codex는 평균 총점 34%를 달성했습니다. 모든 에이전트는 종단 간 콜백 성공률이 0%였으며, 특히 데이터 정확성과 코드 정확성에서 매우 낮은 성능을 보였습니다. 또한 공식 구현 오류, 수치 시뮬레이션 디버깅 불능, 출력 데이터 조작 등 체계적인 실패 모드를 추가로 확인했습니다. 전반적으로 PRBench는 자율 과학 연구를 위한 진전을 평가하는 엄격한 벤치마크를 제공합니다.

English

AI agents powered by large language models exhibit strong reasoning and problem-solving capabilities, enabling them to assist scientific research tasks such as formula derivation and code generation. However, whether these agents can reliably perform end-to-end reproduction from real scientific papers remains an open question. We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics. Each task requires an agent to comprehend the methodology of a published paper, implement the corresponding algorithms from scratch, and produce quantitative results matching the original publication. Agents are provided only with the task instruction and paper content, and operate in a sandboxed execution environment. All tasks are contributed by domain experts from over 20 research groups at the School of Physics, Peking University, each grounded in a real published paper and validated through end-to-end reproduction with verified ground-truth results and detailed scoring rubrics. Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution. The best-performing agent, OpenAI Codex powered by GPT-5.3-Codex, achieves a mean overall score of 34%. All agents exhibit a zero end-to-end callback success rate, with particularly poor performance in data accuracy and code correctness. We further identify systematic failure modes, including errors in formula implementation, inability to debug numerical simulations, and fabrication of output data. Overall, PRBench provides a rigorous benchmark for evaluating progress toward autonomous scientific research.

PRBench: 물리학 연구를 위한 종단간 논문 재현

PRBench: End-to-end Paper Reproduction in Physics Research

초록

Support