熵比剪裁:一种用于稳定强化学习的软性全局约束
Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning
December 5, 2025
作者: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou
cs.AI
摘要
大规模语言模型的后训练过程依赖强化学习来提升模型能力与对齐质量。然而,离策略的训练范式会引发分布偏移,往往使策略超出置信区域,导致训练不稳定性,具体表现为策略熵值的波动与梯度不稳定。尽管PPO-Clip通过重要性剪裁缓解了这一问题,但仍未考虑动作的全局分布偏移。为解决这些挑战,我们提出使用当前策略与历史策略的熵比作为新型全局指标,该指标能有效量化策略探索在更新过程中的相对变化。基于此指标,我们引入了熵比剪裁(ERC)机制,对熵比施加双向约束。这种方法在全局分布层面稳定策略更新,并弥补了PPO-Clip无法调节未采样动作概率偏移的缺陷。我们将ERC机制集成至DAPO和GPPO强化学习算法中,在多基准测试中的实验表明,ERC能持续提升算法性能。
English
Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.