物理科学深度研究：多智能体框架与综合基准

摘要

深度研究智能体是基于大语言模型的系统，专为自主多步科学推理设计，在加速物理科学研究方面潜力巨大。然而，目前对其在该领域能力的全面深入评估仍然匮乏。为填补这一空白，我们提出了PhySciBench——一个与物理科学研究高度相关的基准测试，包含200个由专家精心设计的问题，平衡涵盖物理学与化学，涉及反映真实科研流程的六大任务类别。对当前最先进的模型和智能体系统在PhySciBench上的评估显示其性能有限，即使是最强的基线模型Gemini Deep Research，准确率也仅为33.5%。对失败案例的分析揭示了三个反复出现的缺陷：扩展推理链的脆弱性、跨步骤知识迁移的局限性，以及缺乏基于物理学的自我验证能力。基于这些发现，我们开发了DelveAgent——一个模块化多智能体框架，配备自适应规划循环、双粒度记忆与分层式物理自省机制。在四个科学基准上，DelveAgent将准确率提升了最多7.5个百分点，同时将推理成本降至最强基线的大约三分之一。这些结果确立了PhySciBench作为评估物理科学领域AI系统的关键基准的重要性，并证明了架构专业化能够有效增强自主科学研究的可靠性。

English

Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.