CPGD: 언어 모델을 위한 안정적인 규칙 기반 강화 학습 방향

초록

규칙 기반 강화 학습(RL)의 최근 발전은 규칙 기반 보상을 통해 언어 모델(LM)의 추론 능력을 크게 향상시켰다. 그러나 GRPO, REINFORCE++, RLOO와 같은 기존 RL 방법들은 훈련 불안정성 문제를 자주 겪는데, 이는 큰 정책 업데이트와 부적절한 클리핑이 훈련 붕괴로 이어질 수 있기 때문이다. 이러한 문제를 해결하기 위해, 우리는 정책 드리프트를 고려한 클리핑 정책 경사 최적화(CPGD)라는 새로운 알고리즘을 제안한다. CPGD는 KL 발산을 기반으로 한 정책 드리프트 제약을 도입하여 정책 업데이트를 동적으로 규제하고, 비율의 로그에 클립 메커니즘을 적용하여 과도한 정책 업데이트를 방지한다. 우리는 CPGD에 대한 이론적 근거를 제시하고, 실험적 분석을 통해 이전 접근법에서 관찰된 불안정성을 완화함을 입증한다. 또한, CPGD가 훈련 안정성을 유지하면서 성능을 크게 향상시킴을 보인다. 우리의 구현은 이론적 엄밀성과 실용적 사용성을 균형 있게 조화시켜, LM의 사후 훈련에서 RL을 위한 강력한 대안을 제공한다. 우리는 코드를 https://github.com/ModalMinds/MM-EUREKA에서 공개한다.

English

Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.

CPGD: 언어 모델을 위한 안정적인 규칙 기반 강화 학습 방향

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

초록

Support