DuMate-DeepResearch: 再帰的探索とルーブリック基盤の推論による監査可能なマルチエージェントシステム

要旨

Deep Research（DR）は、複雑で自由度の高い研究タスクに取り組むための新たなエージェント的パラダイムとして登場し、問題を反復的に枠組みし、エビデンスを収集し、情報源を検証し、長文レポートを合成するシステムを必要としています。しかし実際には、現在のDRシステムは4つの相互に関連する制約、すなわち範囲が不明確なままの長期的計画、単一エージェントでのタスク分解とスケジューリングのボトルネック、長文合成におけるハルシネーションリスク、そしてプロセス監査可能性の限界に縛られています。本テクニカルレポートでは、Qianfan Agent Foundry上に構築されたマルチエージェントDRフレームワーク、DuMate-DeepResearchを提案します。本フレームワークは、タスク理解、計画、スケジューリングを担当するAgent Coreを、検索、エビデンス収集、レポート生成のための拡張可能なツールエコシステムから分離し、すべての中間判断とツール呼び出しを明示的に追跡可能にします。この基盤の上で、DuMate-DeepResearchはさらに3つのメカニズムを導入します。(i) グラフベースの動的計画戦略により、研究ロードマップを粗から密へと拡張し、振り返り、再計画、バックトラッキング、並列ブランチングを通じて継続的に修正します。(ii) 再帰的二段階実行設計により、複雑な検索サブタスクをそれぞれ独自の計画ループを持つ内部のSearch Agentに委譲し、ノイズの多い検索を分離して長期実行を安定化します。(iii) ルーブリックベースのテスト時最適化メカニズムにより、タスク固有の品質基準を動的に生成し、それらをエビデンスに基づく合成と適応的停止のための動的推論の足場として活用します。2つのディープリサーチベンチマークにおいて、DuMate-DeepResearchは新たな最高水準の結果を達成しました。DeepResearch Benchでは総合スコア58.03%、DeepResearch Bench IIでは総合スコア61.95%を記録し、情報再現性と分析評価で首位となりました。

English

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.