深度研究系統的強化學習基礎：綜述

摘要

深度研究系統，即通過協調推理、在開放網絡和用戶文件中進行搜索以及使用工具來解決複雜多步驟任務的自主AI，正朝著具有規劃器、協調器和執行器的分層部署方向發展。在實踐中，端到端訓練整個系統仍然不切實際，因此大多數工作訓練單一的規劃器，並將其連接到核心工具，如搜索、瀏覽和代碼。雖然監督微調（SFT）確保了協議的忠實性，但它存在模仿和暴露偏差，並且未能充分利用環境反饋。偏好對齊方法如DPO依賴於模式和代理，屬於離策略方法，並且在長期信用分配和多目標權衡方面表現較弱。SFT和DPO的另一個限制是它們依賴於通過模式設計和標記比較來定義人類決策點和子技能。強化學習通過優化軌跡級策略，與閉環工具交互研究保持一致，實現了探索、恢復行為和原則性信用分配，並減少了對這些人類先驗和評分者偏差的依賴。據我們所知，本調查是首個專注於深度研究系統的強化學習基礎的調查。它沿著三個軸線系統化整理了DeepSeek-R1之後的工作：(i) 數據合成與整理；(ii) 涵蓋穩定性、樣本效率、長上下文處理、獎勵與信用設計、多目標優化和多模態整合的自主研究強化學習方法；以及(iii) 自主強化學習訓練系統和框架。我們還涵蓋了代理架構與協調，以及評估和基準測試，包括最近的問答（QA）、視覺問答（VQA）、長篇合成和基於領域的工具交互任務。我們提煉了重複出現的模式，揭示了基礎設施瓶頸，並為使用強化學習訓練健壯、透明的深度研究代理提供了實用指導。

English

Deep research systems, agentic AI that solve complex, multi-step tasks by coordinating reasoning, search across the open web and user files, and tool use, are moving toward hierarchical deployments with a Planner, Coordinator, and Executors. In practice, training entire stacks end-to-end remains impractical, so most work trains a single planner connected to core tools such as search, browsing, and code. While SFT imparts protocol fidelity, it suffers from imitation and exposure biases and underuses environment feedback. Preference alignment methods such as DPO are schema and proxy-dependent, off-policy, and weak for long-horizon credit assignment and multi-objective trade-offs. A further limitation of SFT and DPO is their reliance on human defined decision points and subskills through schema design and labeled comparisons. Reinforcement learning aligns with closed-loop, tool-interaction research by optimizing trajectory-level policies, enabling exploration, recovery behaviors, and principled credit assignment, and it reduces dependence on such human priors and rater biases. This survey is, to our knowledge, the first dedicated to the RL foundations of deep research systems. It systematizes work after DeepSeek-R1 along three axes: (i) data synthesis and curation; (ii) RL methods for agentic research covering stability, sample efficiency, long context handling, reward and credit design, multi-objective optimization, and multimodal integration; and (iii) agentic RL training systems and frameworks. We also cover agent architecture and coordination, as well as evaluation and benchmarks, including recent QA, VQA, long-form synthesis, and domain-grounded, tool-interaction tasks. We distill recurring patterns, surface infrastructure bottlenecks, and offer practical guidance for training robust, transparent deep research agents with RL.

深度研究系統的強化學習基礎：綜述

Reinforcement Learning Foundations for Deep Research Systems: A Survey

摘要

Support