马可深度研究:通过以验证为核心的设计解锁高效深度研究智能体
Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design
March 30, 2026
作者: Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang, Longyue Wang, Zhao Xu, Weihua Luo
cs.AI
摘要
深度研究智能体能够自主开展开放式调研,通过整合复杂信息检索与跨源多步推理来解决现实世界问题。为在长周期任务中持续保持这种能力,可靠的验证机制在训练和推理阶段都至关重要。现有范式的主要瓶颈在于问答数据合成、轨迹构建和测试时扩展中缺乏显式验证机制,各阶段产生的误差会向下游传递并降低智能体整体性能。为此,我们推出Marco DeepResearch——一个采用三层验证中心化框架设计的深度研究智能体:(1)问答数据合成层面,我们为基于图谱和智能体的问答合成引入验证机制,在控制问题难度的同时确保答案唯一正确;(2)轨迹构建层面,我们设计验证驱动的轨迹合成方法,将显式验证模式注入训练轨迹;(3)测试时扩展层面,在推理阶段使用Marco DeepResearch自身作为验证器,有效提升复杂问题的处理性能。大量实验结果表明,我们所提出的Marco DeepResearch智能体在BrowseComp、BrowseComp-ZH等高难度基准测试中显著优于8B规模的深度研究智能体。值得注意的是,在600次工具调用的最大预算下,Marco DeepResearch甚至超越或接近Tongyi DeepResearch-30B等若干30B规模智能体的表现。
English
Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; (2)~Trajectory Construction: We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and (3)~Test-time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.