DuMate-DeepResearch：一個具有遞迴搜索與基於評分標準推理的可審計多智能體系統

摘要

深度研究（Deep Research, DR）已成為一種新興的智能體範式，用於處理複雜且開放式的研究任務，要求系統能迭代地構建問題、獲取證據、驗證來源並綜合生成長篇報告。然而在實際應用中，現有的深度研究系統受到四項相互關聯的限制：在範圍未明確界定的情況下進行長時程規劃、在單一智能體內分解與排程此類任務的瓶頸、長篇綜合生成中的幻覺風險，以及流程可審計性不足。本技術報告提出 DuMate-DeepResearch，這是一個基於千帆智能體開發平台（Qianfan Agent Foundry）建構的多智能體深度研究框架。該框架將負責任務理解、規劃與排程的智能體核心（Agent Core），與具備可擴展性的工具生態系統（Tool Ecosystem）進行解耦，後者負責檢索、證據獲取與報告渲染，使每個中間決策與工具調用皆可明確追溯。在此基礎上，DuMate-DeepResearch 進一步導入三項機制：（一）基於圖結構的動態規劃策略，從粗略到細緻地展開研究路線圖，並透過反思、重新規劃、回溯與平行分支持續修正；（二）遞迴式的兩層執行設計，將每個複雜的搜尋子任務委派給內部搜尋智能體（Inner Search Agent），該智能體執行自身的規劃循環，以隔離雜訊檢索並穩定長時程執行；（三）基於評分標準（Rubric）的測試時最佳化機制，動態生成任務專屬的品質準則，並將其作為實時推理支架，用於以證據為基礎的綜合生成與自適應停止。在兩項深度研究基準測試中，DuMate-DeepResearch 均創下新的最佳成果：在 DeepResearch Bench 上取得最高總分（58.03%），在 DeepResearch Bench II 上亦取得最高總分（61.95%），同時在資訊召回與分析項目中排名第一。

English

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.