體驗式強化學習
Experiential Reinforcement Learning
February 15, 2026
作者: Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, Jieyu Zhao
cs.AI
摘要
強化學習已成為語言模型從環境獎勵或回饋中學習的核心方法。在實際應用中,環境回饋通常具有稀疏性和延遲性。從這類信號中學習極具挑戰性,因為語言模型必須隱式推斷如何將觀察到的失敗轉化為後續迭代的行為調整。我們提出體驗式強化學習(ERL),這種訓練範式將顯式的「體驗-反思-鞏固」循環嵌入強化學習過程。針對特定任務,模型會生成初始嘗試、接收環境回饋,並產生引導二次優化嘗試的反思報告,其成功經驗將被強化並內化至基礎策略中。該過程將回饋轉化為結構化的行為修正,既能提升探索效率與優化穩定性,又能在部署時保持效果增益且無需增加推理成本。在稀疏獎勵控制環境與智能體推理基準測試中,ERL相較於強基線強化學習方法,持續展現出更高的學習效率與最終性能——在複雜多步環境中實現最高達81%的性能提升,在工具使用推理任務中獲得最高11%的改進。這些結果表明,將顯式自我反思整合至策略訓練,能為「將回饋轉化為持久行為改進」提供實用機制。
English
Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.