穩定強化學習與大型語言模型:理論框架與實踐方法
Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
December 1, 2025
作者: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin
cs.AI
摘要
本論文提出了一種基於大型語言模型的強化學習新框架,闡明瞭在何種條件下可透過REINFORCE等策略梯度方法中的代理詞元級目標函數來優化真實序列級獎勵。具體而言,通過一階近似分析,我們證明僅當訓練-推論差異與策略陳舊性同時最小化時,該代理目標的有效性才會顯著提升。這一發現為多種廣泛採用的穩定RL訓練技術提供了理論解釋,包括重要性採樣校正、梯度裁剪,以及特別針對專家混合模型的路由重播技術。通過對總計數十萬GPU小時的300億參數MoE模型進行大量實驗,我們發現:在在線策略訓練中,帶有重要性採樣校正的基礎策略梯度算法能實現最高訓練穩定性;當引入離線策略更新以加速收斂時,結合梯度裁剪與路由重播技術對於緩解策略陳舊性引發的不穩定性至關重要。值得注意的是,一旦訓練趨於穩定,無論冷啟動初始化方式如何,延長優化時間均能獲得相當的最終性能。我們期望這些關於穩定RL訓練的洞見與實踐方案能為未來研究提供助力。
English
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.