CPGD：面向語言模型穩定基於規則的強化學習

摘要

基於規則的強化學習（RL）近期進展顯著提升了語言模型（LM）在規則獎勵下的推理能力。然而，現有的RL方法——如GRPO、REINFORCE++和RLOO——常面臨訓練不穩定的問題，其中大幅度的策略更新及不當的裁剪可能導致訓練崩潰。為解決此問題，我們提出了帶有策略漂移約束的裁剪策略梯度優化（CPGD），這是一種新穎的算法，旨在穩定LM中的策略學習。CPGD引入了基於KL散度的策略漂移約束，以動態正則化策略更新，並利用對數比率的裁剪機制來防止過度的策略更新。我們為CPGD提供了理論依據，並通過實證分析證明其能夠緩解先前方法中觀察到的不穩定性。此外，我們展示了CPGD在保持訓練穩定性的同時，顯著提升了性能。我們的實現平衡了理論嚴謹性與實際可用性，為LM的後訓練提供了一種穩健的RL替代方案。我們在https://github.com/ModalMinds/MM-EUREKA上發布了代碼。

English

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

CPGD：面向語言模型穩定基於規則的強化學習

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

摘要

Support