深淵探索研究:解鎖小型語言模型的長時程信息檢索與綜合能力
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs
September 28, 2025
作者: Shreyas Singh, Kunal Singh, Pradeep Moturi
cs.AI
摘要
工具集成推理已成為實現代理應用的關鍵焦點。其中,深度研究代理因其在複雜、開放式信息搜尋任務中的卓越表現而受到廣泛關注。我們介紹了Fathom-DeepResearch,這是一個由兩個專門模型組成的代理系統。第一個是Fathom-Search-4B,這是一個基於Qwen3-4B訓練的深度搜尋模型,專為通過實時網絡搜尋和目標網頁查詢進行基於證據的調查而優化。其訓練結合了三項創新:(i) DUETQA,一個通過多代理自我對抗生成的5K樣本數據集,強制依賴網絡搜尋並實現異質來源的基礎;(ii) RAPO,GRPO的零開銷擴展,通過課程修剪、獎勵感知的優勢縮放和每提示重放緩存來穩定多輪可驗證獎勵的強化學習;(iii) 可引導的步驟級獎勵,根據認知行為和邊際效用對每個工具調用進行分類,從而實現對搜尋軌跡廣度、深度和視野的明確控制。這些改進使得在必要時能夠可靠地將工具調用擴展至超過20次。第二個是Fathom-Synthesizer-4B,基於Qwen3-4B訓練,將多輪深度搜尋軌跡轉換為結構化、引用密集的深度研究報告,以實現全面綜合。在深度搜尋基準(SimpleQA、FRAMES、WebWalker、Seal0、MuSiQue)和DeepResearch-Bench上的評估顯示,該系統在開放權重類別中達到了最先進的性能,同時在包括HLE、AIME-25、GPQA-Diamond和MedQA在內的多樣化推理任務中展現出強大的泛化能力。
English
Tool-integrated reasoning has emerged as a key focus for enabling agentic
applications. Among these, DeepResearch Agents have gained significant
attention for their strong performance on complex, open-ended
information-seeking tasks. We introduce Fathom-DeepResearch, an agentic system
composed of two specialized models. The first is Fathom-Search-4B, a DeepSearch
model trained from Qwen3-4B and optimized for evidence-based investigation
through live web search and targeted webpage querying. Its training combines
three advances: (i) DUETQA, a 5K-sample dataset generated via multi-agent
self-play that enforces strict web-search dependence and heterogeneous source
grounding; (ii) RAPO, a zero-overhead extension of GRPO that stabilizes
multi-turn Reinforcement Learning with Verifiable Rewards through curriculum
pruning, reward-aware advantage scaling, and per-prompt replay buffers; and
(iii) a steerable step-level reward that classifies each tool call by cognitive
behavior and marginal utility, enabling explicit control over search trajectory
breadth, depth, and horizon. These improvements enable reliable extension of
tool-calling beyond 20 calls when warranted. The second is
Fathom-Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn
DeepSearch traces into structured, citation-dense DeepResearch Reports for
comprehensive synthesis. Evaluated on DeepSearch benchmarks (SimpleQA, FRAMES,
WebWalker, Seal0, MuSiQue) and DeepResearch-Bench, the system achieves
state-of-the-art performance in the open-weights category while demonstrating
strong generalization to diverse reasoning tasks including HLE, AIME-25,
GPQA-Diamond, and MedQA.