重新思考大语言模型强化学习中的散度正则化

摘要

强化学习（RL）已成为大型语言模型（LLM）后训练的关键组成部分。在实践中，由于训练-推理不匹配和策略过时，LLM的RL通常采用离策略（off-policy）方式，这使得信任区域控制对于稳定优化至关重要。主流方法如PPO和GRPO通过比率裁剪机制近似实现这种控制，但在长尾词表中，重要性比率可能无法有效表征分布偏移。近期工作如DPPO通过将基于比率的裁剪替换为基于散度的掩码来解决这一不匹配问题，从而定义一个由采样token绝对概率偏移决定的信任区域。然而，DPPO仍依赖于硬掩码：一旦某个token以有害方向跨越信任区域边界，其梯度会被丢弃而非修正。为解决此问题，我们提出散度正则化策略优化（DRPO），该方法将硬掩码替换为关于策略偏移的平滑优势加权二次正则项。DRPO保留了与DPPO相同的信任区域几何结构，同时引入有界且连续的梯度权重，这些权重能衰减发散性更新，并在边界外提供修正信号。跨模型规模、架构和精度设置的实验表明，DRPO提升了LLM RL训练的稳定性和效率。

English

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.