超越80/20法则:高熵少数词元驱动LLM推理的有效强化学习
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
June 2, 2025
作者: Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xionghui Chen, Jianxin Yang, Zhenru Zhang, Yuqiong Liu, An Yang, Andrew Zhao, Yang Yue, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin
cs.AI
摘要
可驗證獎勵的強化學習(RLVR)已成為提升大型語言模型(LLMs)推理能力的一種強大方法,但其機制尚未被充分理解。在本研究中,我們從詞元熵模式的新視角對RLVR進行了開創性探索,全面分析了不同詞元如何影響推理性能。通過檢視鏈式思維(CoT)推理中的詞元熵模式,我們觀察到僅有少部分詞元表現出高熵,這些詞元作為關鍵的分岔點,引導模型走向多樣的推理路徑。此外,研究RLVR訓練過程中熵模式的演變發現,RLVR在很大程度上遵循基礎模型的熵模式,主要調整高熵詞元的熵值。這些發現凸顯了高熵詞元(即分岔詞元)對RLVR的重要性。我們最終通過限制策略梯度更新僅作用於分岔詞元來改進RLVR,並揭示了一個超越80/20法則的發現:僅使用20%的詞元,在Qwen3-8B基礎模型上保持與全梯度更新相當的性能,並在Qwen3-32B(AIME'25上+11.04,AIME'24上+7.71)和Qwen3-14B(AIME'25上+4.79,AIME'24上+5.21)基礎模型上顯著超越全梯度更新,展現出強烈的擴展趨勢。相比之下,僅對80%最低熵詞元進行訓練則導致性能顯著下降。這些發現表明,RLVR的有效性主要源於優化決定推理方向的高熵詞元。總體而言,我們的結果強調了通過詞元熵視角理解RLVR的潛力,並利用高熵少數詞元來優化RLVR,從而進一步提升LLM的推理能力。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a
powerful approach to enhancing the reasoning capabilities of Large Language
Models (LLMs), while its mechanisms are not yet well understood. In this work,
we undertake a pioneering exploration of RLVR through the novel perspective of
token entropy patterns, comprehensively analyzing how different tokens
influence reasoning performance. By examining token entropy patterns in
Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of
tokens exhibit high entropy, and these tokens act as critical forks that steer
the model toward diverse reasoning pathways. Furthermore, studying how entropy
patterns evolve during RLVR training reveals that RLVR largely adheres to the
base model's entropy patterns, primarily adjusting the entropy of high-entropy
tokens. These findings highlight the significance of high-entropy tokens (i.e.,
forking tokens) to RLVR. We ultimately improve RLVR by restricting policy
gradient updates to forking tokens and uncover a finding even beyond the 80/20
rule: utilizing only 20% of the tokens while maintaining performance comparable
to full-gradient updates on the Qwen3-8B base model and significantly
surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71
on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models,
highlighting a strong scaling trend. In contrast, training exclusively on the
80% lowest-entropy tokens leads to a marked decline in performance. These
findings indicate that the efficacy of RLVR primarily arises from optimizing
the high-entropy tokens that decide reasoning directions. Collectively, our
results highlight the potential to understand RLVR through a token-entropy
perspective and optimize RLVR by leveraging high-entropy minority tokens to
further improve LLM reasoning.