基于梯度保持视角的强化学习与视觉推理中的柔性熵控制
Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
February 10, 2026
作者: Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为增强大语言模型(LLM)推理能力的关键方法。然而,持续训练常导致策略熵崩溃,其特征是熵值快速衰减引发过早过度自信、输出多样性降低,以及抑制学习的梯度范数消失。梯度保持裁剪是影响这些动态变化的主要因素,但现有缓解策略多为静态方案,缺乏将裁剪机制与精确熵控制相连接的理论框架。本文从梯度保持裁剪的角度重构强化学习中的熵控制机制。我们首先从理论和实验上验证了特定重要性采样比区域对熵增与熵减的贡献。基于这些发现,我们提出了一种采用动态裁剪阈值的新型调控机制,以实现精确的熵管理。此外,我们设计并评估了动态熵控制策略,包括"先增后减"、"减-增-减"和"振荡衰减"模式。实验结果表明,这些策略能有效缓解熵崩溃现象,并在多个基准测试中取得更优性能。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.