80/20 법칙을 넘어서: 고엔트로피 소수 토큰이 LLM 추론을 위한 효과적인 강화 학습을 주도한다

초록

검증 가능한 보상을 활용한 강화 학습(Reinforcement Learning with Verifiable Rewards, RLVR)은 대규모 언어 모델(Large Language Models, LLMs)의 추론 능력을 향상시키는 강력한 접근법으로 부상했지만, 그 메커니즘은 아직 잘 이해되지 않고 있다. 본 연구에서는 토큰 엔트로피 패턴이라는 새로운 관점을 통해 RLVR을 선구적으로 탐구하며, 서로 다른 토큰이 추론 성능에 미치는 영향을 포괄적으로 분석한다. 사고의 연쇄(Chain-of-Thought, CoT) 추론에서 토큰 엔트로피 패턴을 관찰한 결과, 극소수의 토큰만이 높은 엔트로피를 보이며, 이러한 토큰들이 모델을 다양한 추론 경로로 이끄는 중요한 분기점 역할을 한다는 것을 발견했다. 또한, RLVR 훈련 중 엔트로피 패턴이 어떻게 진화하는지 연구한 결과, RLVR은 기본 모델의 엔트로피 패턴을 크게 따르면서 주로 높은 엔트로피를 가진 토큰의 엔트로피를 조정한다는 것을 확인했다. 이러한 발견은 높은 엔트로피 토큰(즉, 분기 토큰)이 RLVR에 있어서 중요함을 강조한다. 우리는 궁극적으로 정책 그래디언트 업데이트를 분기 토큰으로 제한함으로써 RLVR을 개선했으며, 80/20 법칙을 넘어서는 결과를 발견했다: Qwen3-8B 기본 모델에서는 전체 그래디언트 업데이트와 비슷한 성능을 유지하면서도 토큰의 20%만 활용했고, Qwen3-32B(AIME'25에서 +11.04, AIME'24에서 +7.71)와 Qwen3-14B(AIME'25에서 +4.79, AIME'24에서 +5.21) 기본 모델에서는 전체 그래디언트 업데이트를 크게 능가하며 강력한 스케일링 경향을 보였다. 반면, 가장 낮은 엔트로피를 가진 80%의 토큰만으로 훈련할 경우 성능이 현저히 저하되었다. 이러한 결과는 RLVR의 효율성이 주로 추론 방향을 결정하는 높은 엔트로피 토큰을 최적화함으로써 발생함을 시사한다. 종합적으로, 우리의 연구 결과는 토큰 엔트로피 관점을 통해 RLVR을 이해하고, 높은 엔트로피를 가진 소수 토큰을 활용하여 RLVR을 최적화함으로써 LLM 추론을 더욱 개선할 수 있는 잠재력을 강조한다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs), while its mechanisms are not yet well understood. In this work, we undertake a pioneering exploration of RLVR through the novel perspective of token entropy patterns, comprehensively analyzing how different tokens influence reasoning performance. By examining token entropy patterns in Chain-of-Thought (CoT) reasoning, we observe that only a small fraction of tokens exhibit high entropy, and these tokens act as critical forks that steer the model toward diverse reasoning pathways. Furthermore, studying how entropy patterns evolve during RLVR training reveals that RLVR largely adheres to the base model's entropy patterns, primarily adjusting the entropy of high-entropy tokens. These findings highlight the significance of high-entropy tokens (i.e., forking tokens) to RLVR. We ultimately improve RLVR by restricting policy gradient updates to forking tokens and uncover a finding even beyond the 80/20 rule: utilizing only 20% of the tokens while maintaining performance comparable to full-gradient updates on the Qwen3-8B base model and significantly surpassing full-gradient updates on the Qwen3-32B (+11.04 on AIME'25 and +7.71 on AIME'24) and Qwen3-14B (+4.79 on AIME'25 and +5.21 on AIME'24) base models, highlighting a strong scaling trend. In contrast, training exclusively on the 80% lowest-entropy tokens leads to a marked decline in performance. These findings indicate that the efficacy of RLVR primarily arises from optimizing the high-entropy tokens that decide reasoning directions. Collectively, our results highlight the potential to understand RLVR through a token-entropy perspective and optimize RLVR by leveraging high-entropy minority tokens to further improve LLM reasoning.

80/20 법칙을 넘어서: 고엔트로피 소수 토큰이 LLM 추론을 위한 효과적인 강화 학습을 주도한다

Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

초록

Support