基於梯度保持視角的RLVR靈活熵控制方法 (注:RLVR通常指強化學習與變分推理的結合框架,此處保持技術術號一致性)
Flexible Entropy Control in RLVR with Gradient-Preserving Perspective
February 10, 2026
作者: Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao
cs.AI
摘要
具可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型(LLM)推理能力的關鍵方法。然而,持續訓練常導致策略熵崩潰,其特徵為熵值急遽衰減,引發過早的過度自信、輸出多樣性降低,以及抑制學習的梯度範數消失現象。梯度保留剪裁是影響此動態的主要因素,但現有緩解策略多屬靜態,且缺乏將剪裁機制與精確熵控制相連結的框架。本文從梯度保留剪裁的角度重新構建強化學習中的熵控制。我們首先從理論與實證層面驗證特定重要性取樣比率區域對熵增減的貢獻基於這些發現,我們引入一種採用動態剪裁閾值的新型調控機制,以精確管理熵值。更進一步,我們設計並評估了包括「先增後減」、「減-增-減」及「振盪衰減」在內的動態熵控制策略。實驗結果表明,這些策略能有效緩解熵崩潰,並在多項基準測試中實現卓越性能。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.