ChatPaper.aiChatPaper

稳定强化学习的LLM方法:理论框架与实践探索

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

December 1, 2025
作者: Chujie Zheng, Kai Dang, Bowen Yu, Mingze Li, Huiqiang Jiang, Junrong Lin, Yuqiong Liu, An Yang, Jingren Zhou, Junyang Lin
cs.AI

摘要

本文提出了一种用于大语言模型强化学习的新颖框架,阐释了在政策梯度方法(如REINFORCE)中,为何及何种条件下可通过代理词级目标优化真实的序列级奖励。具体而言,通过一阶近似分析,我们发现仅当训练-推断差异与策略陈旧性均被最小化时,该代理目标的优化有效性才会持续增强。这一洞见为多项广泛采用的稳定RL训练技术提供了理论依据,包括重要性采样校正、梯度裁剪,以及特别针对专家混合模型的路由重放机制。通过对总计达数十万GPU小时的30B专家混合模型进行大量实验,我们证明:在在线策略训练中,采用重要性采样校正的基础策略梯度算法能实现最高的训练稳定性;当引入离线策略更新以加速收敛时,结合梯度裁剪与路由重放技术对于缓解策略陈旧性引起的不稳定至关重要。值得注意的是,一旦训练趋于稳定,无论冷启动初始化方式如何,延长优化时间均能获得具有可比性的最终性能。我们希望这些关于稳定RL训练的洞见与优化方案能为未来研究提供有益参考。
English
This paper proposes a novel formulation for reinforcement learning (RL) with large language models, explaining why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE. Specifically, through a first-order approximation, we show that this surrogate becomes increasingly valid only when both the training-inference discrepancy and policy staleness are minimized. This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training, including importance sampling correction, clipping, and particularly Routing Replay for Mixture-of-Experts (MoE) models. Through extensive experiments with a 30B MoE model totaling hundreds of thousands of GPU hours, we show that for on-policy training, the basic policy gradient algorithm with importance sampling correction achieves the highest training stability. When off-policy updates are introduced to accelerate convergence, combining clipping and Routing Replay becomes essential to mitigate the instability caused by policy staleness. Notably, once training is stabilized, prolonged optimization consistently yields comparable final performance regardless of cold-start initialization. We hope that the shared insights and the developed recipes for stable RL training will facilitate future research.
PDF491December 3, 2025