强化学习的熵机制在推理语言模型中的应用
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
May 28, 2025
作者: Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
cs.AI
摘要
本文旨在解决在利用大型语言模型(LLMs)进行推理时,强化学习(RL)扩展过程中的一个主要障碍——策略熵的崩溃现象。这一现象在未进行熵干预的大量RL实验中普遍存在,表现为策略熵在训练初期急剧下降,随之而来的是探索能力的减弱与策略性能的饱和。实践中,我们建立了熵H与下游性能R之间的转换方程R=-a*e^H+b。这一经验法则强烈表明,策略性能是以策略熵为代价换取的,因此受限于熵的耗尽,且其上限完全可预测为H=0时,R=-a+b。我们的发现强调了在RL计算扩展过程中,为持续探索而进行熵管理的必要性。为此,我们从理论与实证两方面探讨了熵动态。理论推导指出,策略熵的变化由动作概率与对数几率变化之间的协方差驱动,该协方差在使用类似策略梯度算法时与其优势成正比。实证研究显示,协方差项与熵差异值精确匹配,支持了理论结论。此外,协方差项在整个训练过程中大多保持正值,进一步解释了为何策略熵会单调下降。通过理解熵动态背后的机制,我们提出了通过限制高协方差标记的更新来控制熵的方法。具体而言,我们提出了两种简单而有效的技术:Clip-Cov和KL-Cov,分别对高协方差标记进行裁剪和施加KL惩罚。实验表明,这些方法促进了探索,帮助策略逃离熵崩溃,从而实现了更好的下游性能。
English
This paper aims to overcome a major obstacle in scaling RL for reasoning with
LLMs, namely the collapse of policy entropy. Such phenomenon is consistently
observed across vast RL runs without entropy intervention, where the policy
entropy dropped sharply at the early training stage, this diminished
exploratory ability is always accompanied with the saturation of policy
performance. In practice, we establish a transformation equation R=-a*e^H+b
between entropy H and downstream performance R. This empirical law strongly
indicates that, the policy performance is traded from policy entropy, thus
bottlenecked by its exhaustion, and the ceiling is fully predictable H=0,
R=-a+b. Our finding necessitates entropy management for continuous exploration
toward scaling compute for RL. To this end, we investigate entropy dynamics
both theoretically and empirically. Our derivation highlights that, the change
in policy entropy is driven by the covariance between action probability and
the change in logits, which is proportional to its advantage when using Policy
Gradient-like algorithms. Empirical study shows that, the values of covariance
term and entropy differences matched exactly, supporting the theoretical
conclusion. Moreover, the covariance term stays mostly positive throughout
training, further explaining why policy entropy would decrease monotonically.
Through understanding the mechanism behind entropy dynamics, we motivate to
control entropy by restricting the update of high-covariance tokens.
Specifically, we propose two simple yet effective techniques, namely Clip-Cov
and KL-Cov, which clip and apply KL penalty to tokens with high covariances
respectively. Experiments show that these methods encourage exploration, thus
helping policy escape entropy collapse and achieve better downstream
performance.Summary
AI-Generated Summary