SFR-深度研究:邁向自主推理單一代理的有效強化學習
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents
September 8, 2025
作者: Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty
cs.AI
摘要
為大型語言模型(LLMs)配備複雜交錯的推理與工具使用能力,已成為代理型人工智慧研究的關鍵焦點,尤其是在近期以推理為導向(「思考型」)模型取得進展的背景下。此類能力對於解鎖多項重要應用至關重要。其中一項應用便是深度研究(Deep Research, DR),它要求對眾多來源進行廣泛的搜索與推理。本文的工作聚焦於開發具備最小化網絡爬取與Python工具整合的原生自主單代理模型,專為深度研究設計。與多代理系統不同,在後者中,代理承擔預定義角色並在靜態工作流程的每一步接受指令,而自主單代理則根據上下文動態決定其下一步行動,無需人工指示。儘管先前的研究已提出了針對基礎或指令微調LLMs的訓練方案,我們則專注於推理優化模型的持續強化學習(RL),以進一步提升代理技能,同時保持推理能力。為此,我們提出了一種完全基於合成數據的簡單RL方案,並將其應用於多種開源LLMs。我們的最佳變體SFR-DR-20B在「人類最後的考試」基準測試中達到了最高28.7%的成績。此外,我們還進行了關鍵的分析實驗,以更深入地理解我們的方法論。
English
Equipping large language models (LLMs) with complex, interleaved reasoning
and tool-use capabilities has become a key focus in agentic AI research,
especially with recent advances in reasoning-oriented (``thinking'') models.
Such capabilities are key to unlocking a number of important applications. One
such application is Deep Research (DR), which requires extensive search and
reasoning over many sources. Our work in this paper focuses on the development
of native Autonomous Single-Agent models for DR featuring minimal web crawling
and Python tool integration. Unlike multi-agent systems, where agents take up
pre-defined roles and are told what to do at each step in a static workflow, an
autonomous single-agent determines its next action dynamically based on
context, without manual directive. While prior work has proposed training
recipes for base or instruction-tuned LLMs, we focus on continual reinforcement
learning (RL) of reasoning-optimized models to further enhance agentic skills
while preserving reasoning ability. Towards this end, we propose a simple RL
recipe with entirely synthetic data, which we apply to various open-source
LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam
benchmark. In addition, we conduct key analysis experiments to provide more
insights into our methodologies.