PRL-Bench：评估大语言模型在物理学前沿研究能力的综合性基准

摘要

智能体科学范式要求人工智能系统具备强健的推理能力，并能进行长周期自主探索。然而现有科学基准仍局限于领域知识理解与复杂推理，未能评估现实科研的探索特性与流程复杂性。本研究聚焦理论与计算物理学这一天然试验场——其具备完整的领域知识体系、复杂推理要求及可验证的端到端工作流，且无需依赖实体实验，据此提出面向科研任务的评估体系。我们推出PRL-Bench（大语言模型物理研究基准），通过系统化映射大语言模型在执行端到端物理研究时的能力边界。该基准选取2025年8月以来《物理评论快报》最新刊载的100篇论文，经领域专家校验，覆盖现代物理学中五个理论与计算密集的子领域：天体物理、凝聚态物理、高能物理、量子信息及统计物理。每项任务设计均复现了真实科研的核心特征，包括探索导向的问题构建、长周期工作流和客观可验证性，从而重构真实物理研究中的关键推理过程与科研工作流。前沿模型评估结果表明，当前性能仍存在局限，最佳综合得分低于50分，揭示出现有大语言模型能力与真实科研需求间的显著差距。PRL-Bench为评估面向自主科学发现的下一代AI科学家提供了可靠试验平台。

English

The paradigm of agentic science requires AI systems to conduct robust reasoning and engage in long-horizon, autonomous exploration. However, current scientific benchmarks remain confined to domain knowledge comprehension and complex reasoning, failing to evaluate the exploratory nature and procedural complexity of real-world research. In this work, we present research-oriented evaluations in theoretical and computational physics, a natural testbed with comprehensive domain knowledge, complex reasoning, and verifiable end-to-end workflows without reliance on experiments. Here we introduce PRL-Bench (Physics Research by LLMs), a benchmark designed to systematically map the capability boundaries of LLMs in executing end-to-end physics research. Constructed from 100 curated papers from the latest issues of Physical Review Letters since August 2025 and validated by domain experts, PRL-Bench covers five major theory- and computation-intensive subfields of modern physics: astrophysics, condensed matter physics, high-energy physics, quantum information, and statistical physics. Each task in the benchmark is designed to replicate the core properties of authentic scientific research, including exploration-oriented formulation, long-horizon workflows, and objective verifiability, thereby reconstructing the essential reasoning processes and research workflows of real physics research. Evaluation across frontier models shows that performance remains limited, with the best overall score below 50, revealing a pronounced gap between current LLM capabilities and the demands of real scientific research. PRL-Bench serves a reliable testbed for accessing next generation AI scientists advancing AI systems toward autonomous scientific discovery.