CPGD：言語モデルのための安定したルールベース強化学習に向けて

要旨

ルールベース強化学習（RL）の最近の進展により、ルールベースの報酬を用いた言語モデル（LM）の推論能力が大幅に向上している。しかし、GRPO、REINFORCE++、RLOOなどの既存のRL手法は、大きなポリシー更新や不適切なクリッピングによって訓練の不安定性が生じ、訓練の崩壊を引き起こすことが多い。この問題に対処するため、本論文では、LMにおけるポリシー学習を安定化するための新しいアルゴリズムであるClipped Policy Gradient Optimization with Policy Drift（CPGD）を提案する。CPGDは、KLダイバージェンスに基づくポリシードリフト制約を導入してポリシー更新を動的に正則化し、比率の対数に対するクリップ機構を活用して過剰なポリシー更新を防ぐ。CPGDの理論的正当性を示し、実証分析を通じて従来の手法で観察された不安定性を軽減することを実証する。さらに、CPGDが訓練の安定性を維持しながら性能を大幅に向上させることを示す。我々の実装は理論的厳密性と実用性のバランスを取り、LMのポストトレーニングにおけるRLの堅牢な代替手段を提供する。コードはhttps://github.com/ModalMinds/MM-EUREKAで公開している。

English

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

CPGD：言語モデルのための安定したルールベース強化学習に向けて

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

要旨

Support