強化學習在推理語言模型中的熵機制
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
May 28, 2025
作者: Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, Ning Ding
cs.AI
摘要
本文旨在克服在利用大型語言模型(LLMs)進行推理的強化學習(RL)擴展中的一個主要障礙,即策略熵的崩潰現象。這一現象在未進行熵干預的大量RL運行中普遍存在,其中策略熵在訓練初期急劇下降,這種探索能力的減弱總是伴隨著策略性能的飽和。實踐中,我們建立了熵H與下游性能R之間的轉換方程R=-a*e^H+b。這一經驗法則強烈表明,策略性能是以策略熵為代價換取的,因此受其耗盡的限制,且上限完全可預測為H=0,R=-a+b。我們的發現強調了在為RL擴展計算資源時,進行熵管理以持續探索的必要性。為此,我們從理論和實證兩個角度研究了熵的動態變化。我們的推導指出,策略熵的變化是由動作概率與對數概率變化之間的協方差驅動的,在使用類似策略梯度算法時,這一協方差與其優勢成正比。實證研究表明,協方差項的值與熵差異完全匹配,支持了理論結論。此外,協方差項在整個訓練過程中大多保持正值,進一步解釋了為什麼策略熵會單調下降。通過理解熵動態背後的機制,我們激勵通過限制高協方差詞元的更新來控制熵。具體而言,我們提出了兩種簡單而有效的技術,即Clip-Cov和KL-Cov,分別對高協方差詞元進行裁剪和應用KL懲罰。實驗表明,這些方法鼓勵了探索,從而幫助策略逃離熵崩潰並實現更好的下游性能。
English
This paper aims to overcome a major obstacle in scaling RL for reasoning with
LLMs, namely the collapse of policy entropy. Such phenomenon is consistently
observed across vast RL runs without entropy intervention, where the policy
entropy dropped sharply at the early training stage, this diminished
exploratory ability is always accompanied with the saturation of policy
performance. In practice, we establish a transformation equation R=-a*e^H+b
between entropy H and downstream performance R. This empirical law strongly
indicates that, the policy performance is traded from policy entropy, thus
bottlenecked by its exhaustion, and the ceiling is fully predictable H=0,
R=-a+b. Our finding necessitates entropy management for continuous exploration
toward scaling compute for RL. To this end, we investigate entropy dynamics
both theoretically and empirically. Our derivation highlights that, the change
in policy entropy is driven by the covariance between action probability and
the change in logits, which is proportional to its advantage when using Policy
Gradient-like algorithms. Empirical study shows that, the values of covariance
term and entropy differences matched exactly, supporting the theoretical
conclusion. Moreover, the covariance term stays mostly positive throughout
training, further explaining why policy entropy would decrease monotonically.
Through understanding the mechanism behind entropy dynamics, we motivate to
control entropy by restricting the update of high-covariance tokens.
Specifically, we propose two simple yet effective techniques, namely Clip-Cov
and KL-Cov, which clip and apply KL penalty to tokens with high covariances
respectively. Experiments show that these methods encourage exploration, thus
helping policy escape entropy collapse and achieve better downstream
performance.Summary
AI-Generated Summary