Fathom-Deep Research: 長期視野の情報検索と合成をSLM向けに解き放つ

要旨

ツール統合型推論は、エージェント型アプリケーションを実現するための重要な焦点として浮上している。その中でも、DeepResearchエージェントは、複雑で開放的な情報探索タスクにおける高いパフォーマンスで注目を集めている。本稿では、Fathom-DeepResearchを紹介する。これは、2つの専門化されたモデルで構成されるエージェントシステムである。1つ目はFathom-Search-4Bで、Qwen3-4Bを基に訓練されたDeepSearchモデルであり、ライブウェブ検索とターゲットウェブページクエリを通じた証拠に基づく調査に最適化されている。その訓練には、以下の3つの進展が組み込まれている：(i) DUETQA、これはマルチエージェントの自己対戦を通じて生成された5Kサンプルのデータセットであり、厳格なウェブ検索依存性と異種ソースの根拠付けを強化する；(ii) RAPO、これはGRPOのゼロオーバーヘッド拡張であり、カリキュラムプルーニング、報酬認識型アドバンテージスケーリング、およびプロンプトごとのリプレイバッファを通じて、検証可能な報酬を用いたマルチターン強化学習を安定化する；(iii) ステップレベルの報酬を操縦可能にし、各ツール呼び出しを認知行動と限界効用によって分類し、検索軌跡の幅、深さ、および視野を明示的に制御する。これらの改善により、必要に応じてツール呼び出しを20回以上に拡張することが可能となる。2つ目はFathom-Synthesizer-4Bで、Qwen3-4Bを基に訓練され、マルチターンのDeepSearchトレースを構造化された引用密度の高いDeepResearchレポートに変換し、包括的な統合を実現する。DeepSearchベンチマーク（SimpleQA、FRAMES、WebWalker、Seal0、MuSiQue）およびDeepResearch-Benchで評価された結果、本システムはオープンウェイトカテゴリーにおいて最先端のパフォーマンスを達成し、HLE、AIME-25、GPQA-Diamond、MedQAなどの多様な推論タスクへの強い汎化能力を示した。

English

Tool-integrated reasoning has emerged as a key focus for enabling agentic applications. Among these, DeepResearch Agents have gained significant attention for their strong performance on complex, open-ended information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch model trained from Qwen3-4B and optimized for evidence-based investigation through live web search and targeted webpage querying. Its training combines three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent self-play that enforces strict web-search dependence and heterogeneous source grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes multi-turn Reinforcement Learning with Verifiable Rewards through curriculum pruning, reward-aware advantage scaling, and per-prompt replay buffers; and (iii) a steerable step-level reward that classifies each tool call by cognitive behavior and marginal utility, enabling explicit control over search trajectory breadth, depth, and horizon. These improvements enable reliable extension of tool-calling beyond 20 calls when warranted. The second is Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports for comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES, WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves state-of-the-art performance in the open-weights category while demonstrating strong generalization to diverse reasoning tasks including HLE, AIME-25, GPQA-Diamond, and MedQA.