深度研究系统中的强化学习基础:综述
Reinforcement Learning Foundations for Deep Research Systems: A Survey
September 8, 2025
作者: Wenjun Li, Zhi Chen, Jingru Lin, Hannan Cao, Wei Han, Sheng Liang, Zhi Zhang, Kuicai Dong, Dexun Li, Chen Zhang, Yong Liu
cs.AI
摘要
深度研究系统,即通过协调推理、在开放网络和用户文件中进行搜索以及工具使用来解决复杂多步骤任务的自主AI,正朝着包含规划器、协调器和执行器的分层部署方向发展。实际上,端到端训练整个系统栈仍不切实际,因此大多数工作仅训练一个与核心工具(如搜索、浏览和代码)相连的单一规划器。虽然监督微调(SFT)确保了协议的一致性,但它存在模仿和暴露偏差,且未能充分利用环境反馈。偏好对齐方法如DPO依赖于模式和代理,属于离策略方法,在长期信用分配和多目标权衡方面表现较弱。SFT和DPO的另一个局限在于它们通过模式设计和标注比较依赖于人类定义的决策点和子技能。强化学习通过优化轨迹级策略,与闭环工具交互研究相契合,支持探索、恢复行为和原则性信用分配,并减少了对此类人类先验和评分者偏差的依赖。
据我们所知,本综述是首篇专注于深度研究系统强化学习基础的研究。它沿着三个轴系统化梳理了DeepSeek-R1之后的工作:(i)数据合成与整理;(ii)涵盖稳定性、样本效率、长上下文处理、奖励与信用设计、多目标优化及多模态集成的自主研究RL方法;以及(iii)自主RL训练系统与框架。我们还探讨了代理架构与协调,以及评估与基准测试,包括近期的问答(QA)、视觉问答(VQA)、长篇合成及领域基础、工具交互任务。我们提炼了重复出现的模式,揭示了基础设施瓶颈,并为使用RL训练稳健、透明的深度研究代理提供了实用指导。
English
Deep research systems, agentic AI that solve complex, multi-step tasks by
coordinating reasoning, search across the open web and user files, and tool
use, are moving toward hierarchical deployments with a Planner, Coordinator,
and Executors. In practice, training entire stacks end-to-end remains
impractical, so most work trains a single planner connected to core tools such
as search, browsing, and code. While SFT imparts protocol fidelity, it suffers
from imitation and exposure biases and underuses environment feedback.
Preference alignment methods such as DPO are schema and proxy-dependent,
off-policy, and weak for long-horizon credit assignment and multi-objective
trade-offs. A further limitation of SFT and DPO is their reliance on human
defined decision points and subskills through schema design and labeled
comparisons. Reinforcement learning aligns with closed-loop, tool-interaction
research by optimizing trajectory-level policies, enabling exploration,
recovery behaviors, and principled credit assignment, and it reduces dependence
on such human priors and rater biases.
This survey is, to our knowledge, the first dedicated to the RL foundations
of deep research systems. It systematizes work after DeepSeek-R1 along three
axes: (i) data synthesis and curation; (ii) RL methods for agentic research
covering stability, sample efficiency, long context handling, reward and credit
design, multi-objective optimization, and multimodal integration; and (iii)
agentic RL training systems and frameworks. We also cover agent architecture
and coordination, as well as evaluation and benchmarks, including recent QA,
VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We
distill recurring patterns, surface infrastructure bottlenecks, and offer
practical guidance for training robust, transparent deep research agents with
RL.