ChatPaper.aiChatPaper

熵比裁剪:作為穩定強化學習的軟性全域約束

Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning

December 5, 2025
作者: Zhenpeng Su, Leiyu Pan, Minxuan Lv, Tiehua Mei, Zijia Lin, Yuntao Li, Wenping Hu, Ruiming Tang, Kun Gai, Guorui Zhou
cs.AI

摘要

大型語言模型的後訓練依賴強化學習來提升模型能力與對齊品質。然而,離策略的訓練模式會引發分佈偏移,往往使策略超出信任區域,導致訓練不穩定,表現為策略熵值的波動與梯度不穩定。儘管PPO-Clip透過重要性剪裁緩解了此問題,但仍未考慮動作的全局分佈偏移。為解決這些挑戰,我們提出以當前策略與先前策略的熵值比作為新全局指標,有效量化策略探索在更新過程中的相對變化。基於此指標,我們引入熵比剪裁機制,對熵比施加雙向約束。這能在全局分佈層面穩定策略更新,並彌補PPO-clip無法調節未採樣動作概率偏移的缺陷。我們將ERC整合至DAPO與GPPO強化學習演算法中,在多個基準測試中的實驗表明ERC能持續提升效能。
English
Large language model post-training relies on reinforcement learning to improve model capability and alignment quality. However, the off-policy training paradigm introduces distribution shift, which often pushes the policy beyond the trust region, leading to training instabilities manifested as fluctuations in policy entropy and unstable gradients. Although PPO-Clip mitigates this issue through importance clipping, it still overlooks the global distributional shift of actions. To address these challenges, we propose using the entropy ratio between the current and previous policies as a new global metric that effectively quantifies the relative change in policy exploration throughout updates. Building on this metric, we introduce an Entropy Ratio Clipping (ERC) mechanism that imposes bidirectional constraints on the entropy ratio. This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions. We integrate ERC into both DAPO and GPPO reinforcement learning algorithms. Experiments across multiple benchmarks show that ERC consistently improves performance.
PDF162December 9, 2025