マルコ・ディープリサーチ：検証中心設計による効率的な深層研究エージェントの実現

要旨

深層研究エージェントは、現実世界の問題解決に向け、複雑な情報検索と多段階推論を多様な情報源にわたって統合し、自律的に拡張性のある調査を実施する。長期的なタスクにおいてこの能力を持続させるには、学習時と推論時の両方で信頼性の高い検証が不可欠である。既存のパラダイムにおける主要なボトルネックは、QAデータ合成、軌道構築、テスト時スケーリングにおいて明示的な検証メカニズムが欠如している点にある。各段階で生じた誤りは下流に伝播し、エージェントの総合的な性能を低下させる。この問題に対処するため、我々は検証中心のフレームワーク設計を3層で最適化した深層研究エージェントMarco DeepResearchを提案する。(1) QAデータ合成：グラフベース及びエージェントベースのQA合成に検証メカニズムを導入し、回答の一意性と正確性を保証しつつ問題難易度を制御。(2) 軌道構築：明示的な検証パターンを学習軌道に注入する検証駆動型軌道合成手法を設計。(3) テスト時スケーリング：推論時にMarco DeepResearch自身を検証器として活用し、難易度の高い問題における性能を効果的に改善。大規模な実験結果により、提案するMarco DeepResearchエージェントが、BrowseCompやBrowseComp-ZHなどの難易度の高いベンチマークにおいて、8B規模の深層研究エージェントを大きく上回ることを実証した。特に、ツール呼び出し最大600回の制約下では、Tongyi DeepResearch-30Bといった複数の30B規模エージェントをも凌駕あるいは接近する性能を示した。

English

Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems. To sustain this capability on long-horizon tasks, reliable verification is critical during both training and inference. A major bottleneck in existing paradigms stems from the lack of explicit verification mechanisms in QA data synthesis, trajectory construction, and test-time scaling. Errors introduced at each stage propagate downstream and degrade the overall agent performance. To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA synthesis to control question difficulty while ensuring answers are unique and correct; (2)~Trajectory Construction: We design a verification-driven trajectory synthesis method that injects explicit verification patterns into training trajectories; and (3)~Test-time scaling: We use Marco DeepResearch itself as a verifier at inference time and effectively improve performance on challenging questions. Extensive experimental results demonstrate that our proposed Marco DeepResearch agent significantly outperforms 8B-scale deep research agents on most challenging benchmarks, such as BrowseComp and BrowseComp-ZH. Crucially, under a maximum budget of 600 tool calls, Marco DeepResearch even surpasses or approaches several 30B-scale agents, like Tongyi DeepResearch-30B.

マルコ・ディープリサーチ：検証中心設計による効率的な深層研究エージェントの実現

Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

要旨

Support