CE-GPPO：通过梯度保持裁剪策略优化控制熵的强化学习方法

摘要

强化学习（RL）已成为优化大型语言模型（LLMs）以处理复杂推理任务的有力范式。这一过程中的核心挑战在于管理策略熵，它反映了训练期间探索与利用之间的平衡。现有方法，如近端策略优化（PPO）及其变体，由于裁剪机制，丢弃了来自低概率令牌的宝贵梯度信号。我们系统地分析了熵动态，揭示了这些被裁剪的令牌在调节熵演化中扮演着关键但被忽视的角色。我们提出了通过梯度保留策略优化控制熵（CE-GPPO）的新算法，该算法以温和且有界的方式重新引入了原生PPO中被裁剪令牌的梯度。通过控制来自裁剪区间外令牌的梯度幅度，CE-GPPO能够实现探索与利用的权衡。我们提供了理论依据和实证证据，表明CE-GPPO有效缓解了熵不稳定性。在数学推理基准上的广泛实验表明，CE-GPPO在不同模型规模上均持续优于强基线。

English

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose Controlling Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

CE-GPPO：通过梯度保持裁剪策略优化控制熵的强化学习方法

CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

摘要

Support