深研探索:为小型语言模型解锁长程信息检索与综合能力
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
September 28, 2025
作者: Shreyas Singh, Kunal Singh, Pradeep Moturi
cs.AI
摘要
工具集成推理已成为实现智能应用的关键焦点。其中,深度研究代理因其在复杂、开放式信息检索任务中的卓越表现而备受关注。我们推出了Fathom-DeepResearch,这是一个由两个专用模型组成的智能系统。首先是Fathom-Search-4B,这是一个基于Qwen3-4B训练的深度搜索模型,专为通过实时网络搜索和定向网页查询进行基于证据的调查而优化。其训练结合了三大创新:(i) DUETQA,一个通过多智能体自博弈生成的5K样本数据集,强制严格的网络搜索依赖性和异质来源锚定;(ii) RAPO,作为GRPO的零开销扩展,通过课程剪枝、奖励感知优势缩放和每提示重放缓冲区,稳定了带有可验证奖励的多轮强化学习;(iii) 可引导的步骤级奖励,按认知行为和边际效用对每次工具调用进行分类,实现对搜索轨迹广度、深度和范围的显式控制。这些改进使得在必要时工具调用可可靠地扩展至20次以上。其次是Fathom-Synthesizer-4B,同样基于Qwen3-4B训练,它将多轮深度搜索轨迹转化为结构化的、引用密集的深度研究报告,实现全面综合。在深度搜索基准(SimpleQA、FRAMES、WebWalker、Seal0、MuSiQue)和DeepResearch-Bench上的评估显示,该系统在开放权重类别中达到了最先进的性能,同时展示了对包括HLE、AIME-25、GPQA-Diamond和MedQA在内的多样化推理任务的强大泛化能力。
English
Tool-integrated reasoning has emerged as a key focus for enabling agentic
applications. Among these, DeepResearch Agents have gained significant
attention for their strong performance on complex, open-ended
information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system
composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch
model trained from Qwen3-4B and optimized for evidence-based investigation
through live web search and targeted webpage querying. Its training combines
three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent
self-play that enforces strict web-search dependence and heterogeneous source
grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes
multi-turn Reinforcement Learning with Verifiable Rewards through curriculum
pruning, reward-aware advantage scaling, and per-prompt replay buffers; and
(iii) a steerable step-level reward that classifies each tool call by cognitive
behavior and marginal utility, enabling explicit control over search trajectory
breadth, depth, and horizon. These improvements enable reliable extension of
tool-calling beyond 20 calls when warranted. The second is
Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn
DeepSearch traces into structured, citation-dense DeepResearch Reports for
comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES,
WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves
state-of-the-art performance in the open-weights category while demonstrating
strong generalization to diverse reasoning tasks including HLE, AIME-25,
GPQA-Diamond, and MedQA.