DuMate-DeepResearch：一个可审计的多智能体系统，具备递归搜索与基于量规的推理能力

摘要

深度研究（Deep Research, DR）作为一种新兴的智能体范式，旨在应对复杂、开放性的研究任务，要求系统能够迭代式地定义问题、获取证据、验证来源，并生成长篇研究报告。然而，在实际应用中，当前DR系统受到四个相互关联的局限：在范围未明确定义的情况下的长周期规划、单一智能体内任务分解与调度的瓶颈、长文本合成中的幻觉风险，以及有限的过程可审计性。本技术报告提出了基于千帆智能体构建平台（Qianfan Agent Foundry）的多智能体DR框架——DuMate-DeepResearch。该框架将负责任务理解、规划与调度的智能体核心（Agent Core）与可扩展的工具生态系统（Tool Ecosystem）解耦，后者负责检索、证据获取及报告渲染，使每个中间决策和工具调用均可显式追溯。在此基础设施之上，DuMate-DeepResearch进一步引入三种机制：（i）基于图的动态规划策略，以由粗到细的方式扩展研究路线图，并通过反思、重规划、回溯及并行分支持续修正；（ii）递归两级执行设计，将每个复杂搜索子任务委托给内部搜索智能体（Search Agent），由其自行执行规划循环，从而隔离噪声检索并稳定长周期执行；（iii）基于评估准则的测试时优化机制，动态生成任务特定的质量标准，并将其作为实时推理支架，用于基于证据的综合与自适应停止。在两个深度研究基准测试中，DuMate-DeepResearch均取得了新的最优结果：在DeepResearch Bench上取得最高综合得分（58.03%），在DeepResearch Bench II上取得最高综合得分（61.95%），同时在信息召回与分析维度排名第一。

English

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.