重新思考大型語言模型強化學習中的散度正則化

摘要

強化學習（RL）已成為大型語言模型（LLM）後訓練階段的關鍵組成部分。實務上，由於訓練與推論不一致以及策略過時，LLM的強化學習常為離策略學習，因此控制信任區域對於穩定優化至關重要。主流方法如PPO與GRPO透過比率裁剪機制近似此控制，但在長尾詞彙分布中，重要性比率可能無法有效反映分布偏移。近期研究如DPPO以基於散度的遮罩取代基於比率的裁剪，藉由取樣詞元的絕對機率偏移定義信任區域。然而，DPPO仍依賴硬遮罩：一旦詞元朝有害方向跨越信任區域邊界，其梯度將被丟棄而非修正。為解決此問題，我們提出散度正則化策略優化（DRPO），以平滑的優勢權重二次正則項取代硬遮罩，作用於策略偏移。DRPO保留了與DPPO相同的信任區域幾何結構，同時引入有界且連續的梯度權重，可衰減發散更新，並在邊界外提供修正信號。跨模型規模、架構與精度設定的實驗顯示，DRPO能提升LLM強化學習訓練的穩定性與效率。

English

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.