SFR-深度研究：邁向自主推理單一代理的有效強化學習

摘要

為大型語言模型（LLMs）配備複雜交錯的推理與工具使用能力，已成為代理型人工智慧研究的關鍵焦點，尤其是在近期以推理為導向（「思考型」）模型取得進展的背景下。此類能力對於解鎖多項重要應用至關重要。其中一項應用便是深度研究（Deep Research, DR），它要求對眾多來源進行廣泛的搜索與推理。本文的工作聚焦於開發具備最小化網絡爬取與Python工具整合的原生自主單代理模型，專為深度研究設計。與多代理系統不同，在後者中，代理承擔預定義角色並在靜態工作流程的每一步接受指令，而自主單代理則根據上下文動態決定其下一步行動，無需人工指示。儘管先前的研究已提出了針對基礎或指令微調LLMs的訓練方案，我們則專注於推理優化模型的持續強化學習（RL），以進一步提升代理技能，同時保持推理能力。為此，我們提出了一種完全基於合成數據的簡單RL方案，並將其應用於多種開源LLMs。我們的最佳變體SFR-DR-20B在「人類最後的考試」基準測試中達到了最高28.7%的成績。此外，我們還進行了關鍵的分析實驗，以更深入地理解我們的方法論。

English

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

SFR-深度研究：邁向自主推理單一代理的有效強化學習

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

摘要

Support