CE-GPPO：通過梯度保留剪裁策略優化控制熵的強化學習方法

摘要

強化學習（Reinforcement Learning, RL）已成為優化大型語言模型（Large Language Models, LLMs）以處理複雜推理任務的強大範式。在此過程中，一個核心挑戰在於管理策略熵（policy entropy），這反映了訓練期間探索與利用之間的平衡。現有方法，如近端策略優化（Proximal Policy Optimization, PPO）及其變體，由於裁剪機制而丟棄了來自低概率詞元的寶貴梯度信號。我們系統性地分析了熵的動態變化，並揭示這些被裁剪的詞元在調節熵演化中扮演著關鍵卻被忽視的角色。我們提出了通過梯度保留策略優化控制熵（Controlling Entropy via Gradient-Preserving Policy Optimization, CE-GPPO），這是一種新穎的算法，它以溫和且有界的方式重新引入了原生PPO中被裁剪詞元的梯度。通過控制來自裁剪區間外詞元的梯度大小，CE-GPPO能夠實現探索與利用的平衡。我們提供了理論依據和實證證據，表明CE-GPPO有效緩解了熵的不穩定性。在數學推理基準上的廣泛實驗顯示，CE-GPPO在不同模型規模下均持續超越強基線。

English

Reinforcement learning (RL) has become a powerful paradigm for optimizing large language models (LLMs) to handle complex reasoning tasks. A core challenge in this process lies in managing policy entropy, which reflects the balance between exploration and exploitation during training. Existing methods, such as proximal policy optimization (PPO) and its variants, discard valuable gradient signals from low-probability tokens due to the clipping mechanism. We systematically analyze the entropy dynamics and reveal that these clipped tokens play a critical yet overlooked role in regulating entropy evolution. We propose Controlling Entropy via Gradient-Preserving Policy Optimization (CE-GPPO), a novel algorithm that reintroduces gradients from clipped tokens in native PPO in a gentle and bounded manner. By controlling the magnitude of gradients from tokens outside the clipping interval, CE-GPPO is able to achieve an exploration-exploitation trade-off. We provide theoretical justification and empirical evidence showing that CE-GPPO effectively mitigates entropy instability. Extensive experiments on mathematical reasoning benchmarks show that CE-GPPO consistently outperforms strong baselines across different model scales.

CE-GPPO：通過梯度保留剪裁策略優化控制熵的強化學習方法

CE-GPPO: Controlling Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning

摘要

Support