CPGD：面向语言模型的稳定规则强化学习方法

摘要

近期，基于规则的强化学习（RL）在提升语言模型（LM）的推理能力方面取得了显著进展，尤其是通过引入基于规则的奖励机制。然而，现有的RL方法——如GRPO、REINFORCE++和RLOO——常面临训练不稳定的问题，其中大幅度的策略更新和不恰当的裁剪可能导致训练崩溃。为解决这一难题，我们提出了带策略漂移约束的裁剪策略梯度优化算法（CPGD），这是一种旨在稳定语言模型策略学习的新颖算法。CPGD通过基于KL散度的策略漂移约束动态调节策略更新，并利用对数比率的裁剪机制来防止策略更新过度。我们为CPGD提供了理论依据，并通过实证分析证明其有效缓解了先前方法中观察到的不稳定性。此外，研究表明，CPGD在保持训练稳定性的同时，显著提升了性能表现。我们的实现兼顾了理论严谨性与实际可用性，为语言模型的后训练阶段提供了一种稳健的RL替代方案。代码已发布于https://github.com/ModalMinds/MM-EUREKA。

English

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

CPGD：面向语言模型的稳定规则强化学习方法

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

摘要

Support