ChatPaper.aiChatPaper

物理科学深度研究:多智能体框架与综合基准

Deep Research in Physical Sciences: A Multi-Agent Framework and Comprehensive Benchmark

June 17, 2026
作者: Yigeng Jiang, Tengchao Yang, Taoyong Cui, Jiaxing Wan, Yuan Wang, Weida Wang, Zhiyu Liu, Chuyi Peng, Binzhao Luo, Maoli Gao, Huaihai Huang, Yuqianer Zeng, Ziyang Zheng, Dongchen Huang, Chao Chen, Zichao Liu, Weiping Shen, Shuchen Pu, Siyu Zhou, Runmin Ma, Yusong Hu, Fei Chao, Bo Zhang, Xiawu Zheng, Zifu Wang, Lei Bai, Yunqi Cai, Shufei Zhang
cs.AI

摘要

深度研究智能体是基于大语言模型的系统,专为自主多步科学推理设计,在加速物理科学研究方面潜力巨大。然而,目前对其在该领域能力的全面深入评估仍然匮乏。为填补这一空白,我们提出了PhySciBench——一个与物理科学研究高度相关的基准测试,包含200个由专家精心设计的问题,平衡涵盖物理学与化学,涉及反映真实科研流程的六大任务类别。对当前最先进的模型和智能体系统在PhySciBench上的评估显示其性能有限,即使是最强的基线模型Gemini Deep Research,准确率也仅为33.5%。对失败案例的分析揭示了三个反复出现的缺陷:扩展推理链的脆弱性、跨步骤知识迁移的局限性,以及缺乏基于物理学的自我验证能力。基于这些发现,我们开发了DelveAgent——一个模块化多智能体框架,配备自适应规划循环、双粒度记忆与分层式物理自省机制。在四个科学基准上,DelveAgent将准确率提升了最多7.5个百分点,同时将推理成本降至最强基线的大约三分之一。这些结果确立了PhySciBench作为评估物理科学领域AI系统的关键基准的重要性,并证明了架构专业化能够有效增强自主科学研究的可靠性。
English
Deep research agents are Large Language Model (LLM)-based systems designed for autonomous, multi-step scientific reasoning, and they hold immense potential for accelerating research in the physical sciences. However, comprehensive and in-depth evaluations of their capabilities within this domain remain lacking. To address this gap, we introduce PhySciBench, a benchmark highly relevant to physical science research, comprising 200 expert-curated questions, balanced between physics and chemistry, across six task categories that reflect real-world scientific workflows. Evaluations of state-of-the-art models and agent systems on PhySciBench reveal limited performance; even the strongest baseline, Gemini Deep Research, achieves an accuracy of only 33.5%. Analysis of failure cases identifies three recurrent deficiencies: fragility in extended reasoning chains, limited knowledge transfer across steps, and a lack of physics-grounded self-verification. Motivated by these findings, we develop DelveAgent, a modular multi-agent framework equipped with an adaptive planning loop, dual-granularity memory, and a hierarchical physics-grounded reflection mechanism. Across four scientific benchmarks, DelveAgent improves accuracy by up to 7.5 percentage points while reducing inference costs to approximately one-third of the strongest baseline. These results establish the significance of PhySciBench as a critical benchmark for evaluating AI systems in the physical sciences and demonstrate that architectural specialization can effectively enhance the reliability of autonomous scientific research.