超越二八定律:高熵少数词元驱动大语言模型推理的有效强化学习
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
June 2, 2025
作者: Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)已成为提升大型语言模型(LLMs)推理能力的一种强大方法,但其机制尚未得到充分理解。本研究首次从标记熵模式的新视角对RLVR进行了开创性探索,全面分析了不同标记如何影响推理性能。通过考察链式思维(CoT)推理中的标记熵模式,我们发现仅有少量标记表现出高熵特性,这些标记作为关键分岔点,引导模型走向多样化的推理路径。进一步研究RLVR训练过程中熵模式的演变表明,RLVR在很大程度上遵循基础模型的熵模式,主要调整高熵标记的熵值。这些发现凸显了高熵标记(即分岔标记)对RLVR的重要性。我们最终通过将策略梯度更新限制在分岔标记上改进了RLVR,并揭示了一个超越80/20法则的发现:仅使用20%的标记,在Qwen3-8B基础模型上即可保持与全梯度更新相当的性能,并在Qwen3-32B(AIME'25上+11.04,AIME'24上+7.71)和Qwen3-14B(AIME'25上+4.79,AIME'24上+5.21)基础模型上显著超越全梯度更新,显示出强烈的扩展趋势。相比之下,仅对80%最低熵标记进行训练则导致性能显著下降。这些结果表明,RLVR的有效性主要源于优化决定推理方向的高熵标记。总体而言,我们的研究结果强调了通过标记熵视角理解RLVR的潜力,并利用高熵少数标记优化RLVR,从而进一步提升LLM的推理能力。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful approach to enhancing the reasoning capabilities of Large Language
Models (LLMs), while its mechanisms are not yet well understood. In this work,
we undertake a pioneering exploration of RLVR through the novel perspective of
token entropy patterns, comprehensively analyzing how different tokens
influence reasoning performance. By examining token entropy patterns in
Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of
tokens exhibit high entropy, and these tokens act as critical forks that steer
the model toward diverse reasoning pathways. Furthermore, studying how entropy
patterns evolve during RLVR training reveals that RLVR largely adheres to the
base model's entropy patterns, primarily adjusting the entropy of high-entropy
tokens. These findings highlight the significance of high-entropy tokens (i.e.,
forking tokens) to RLVR. We ultimately improve RLVR by restricting policy
gradient updates to forking tokens and uncover a finding even beyond the 80/20
rule: utilizing only 20% of the tokens while maintaining performance comparable
to full-gradient updates on the Qwen3-8B base model and significantly
surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71
on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models,
highlighting a strong scaling trend. In contrast, training exclusively on the
80% lowest-entropy tokens leads to a marked decline in performance. These
findings indicate that the efficacy of RLVR primarily arises from optimizing
the high-entropy tokens that decide reasoning directions. Collectively, our
results highlight the potential to understand RLVR through a token-entropy
perspective and optimize RLVR by leveraging high-entropy minority tokens to
further improve LLM reasoning.Summary
AI-Generated Summary