SFR-深度研究：迈向自主推理单智能体的高效强化学习

摘要

为大型语言模型（LLMs）配备复杂的交错推理与工具使用能力，已成为智能体AI研究的一个关键焦点，尤其是在推理导向型（“思考”）模型取得最新进展的背景下。这些能力对于解锁一系列重要应用至关重要。其中一项应用便是深度研究（DR），它要求对众多来源进行广泛的搜索与推理。本文的工作聚焦于开发具备最小化网络爬取与Python工具集成的原生自主单智能体模型，以应对DR任务。与多智能体系统中智能体承担预设角色、在静态工作流程中按部就班执行指令不同，自主单智能体能够根据上下文动态决定其下一步行动，无需人工指导。尽管先前的研究已提出了针对基础或指令调优LLMs的训练方案，我们则专注于通过持续强化学习（RL）进一步优化推理模型，以增强智能体技能的同时保持其推理能力。为此，我们提出了一种完全基于合成数据的简单RL方案，并将其应用于多种开源LLMs。我们最佳变体SFR-DR-20B在“人类终极考试”基准测试中取得了高达28.7%的成绩。此外，我们还进行了关键分析实验，以深入理解我们的方法论。

English

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking'') models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

SFR-深度研究：迈向自主推理单智能体的高效强化学习

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

摘要

Support