CE-GPPO: 강화 학습에서 경사 보존 클리핑 정책 최적화를 통한 엔트로피 제어

초록

강화 학습(Reinforcement Learning, RL)은 복잡한 추론 작업을 처리하기 위해 대규모 언어 모델(Large Language Models, LLMs)을 최적화하는 강력한 패러다임으로 자리 잡았습니다. 이 과정에서 핵심적인 과제는 훈련 중 탐색(exploration)과 활용(exploitation) 사이의 균형을 반영하는 정책 엔트로피(policy entropy)를 관리하는 것입니다. 기존의 방법들, 예를 들어 근위 정책 최적화(Proximal Policy Optimization, PPO) 및 그 변형들은 클리핑(clipping) 메커니즘으로 인해 낮은 확률의 토큰에서 발생하는 가치 있는 그래디언트 신호를 버립니다. 우리는 엔트로피 역학을 체계적으로 분석하고, 이러한 클리핑된 토큰들이 엔트로피 진화를 조절하는 데 있어 중요한 역할을 하지만 간과되고 있음을 밝혔습니다. 우리는 그래디언트 보존 정책 최적화를 통해 엔트로피를 제어하는 새로운 알고리즘인 CE-GPPO(Controlling Entropy via Gradient-Preserving Policy Optimization)를 제안합니다. 이 알고리즘은 클리핑 구간 밖의 토큰들에서 발생하는 그래디언트의 크기를 조절함으로써 탐색과 활용 사이의 균형을 달성합니다. 우리는 CE-GPPO가 엔트로피 불안정성을 효과적으로 완화한다는 이론적 근거와 실험적 증거를 제시합니다. 수학적 추론 벤치마크에서의 광범위한 실험을 통해 CE-GPPO가 다양한 모델 규모에서 강력한 베이스라인을 일관되게 능가함을 보여줍니다.

English

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose Controlling Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

CE-GPPO: 강화 학습에서 경사 보존 클리핑 정책 최적화를 통한 엔트로피 제어

CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

초록

Support