低概率令牌在可验证奖励的强化学习中维持探索
Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward
October 3, 2025
作者: Guanhua Huang, Tingqiang Xu, Mingze Wang, Qi Yi, Xue Gong, Siheng Li, Ruibin Xiong, Kejiao Li, Yuhao Jiang, Bo Zhou
cs.AI
摘要
基于可验证奖励的强化学习(RLVR)在推动大语言模型进行复杂推理方面取得了显著进展,但其扩展性常受限于训练瓶颈:随着策略熵的坍缩,性能趋于停滞,这标志着探索能力的丧失。传统方法通常通过维持高策略熵来解决这一问题,然而,对于有效探索机制的具体理解仍显不足。我们的分析表明,不加选择地关注熵值可能会放大无关词汇,进而扰乱训练过程。本文深入探讨了RLVR中的探索动态,揭示了一个关键问题:有价值的低概率探索性词汇——我们称之为“推理火花”——在训练过程中被逐步淘汰。我们发现,尽管这些火花在预训练模型中大量存在,但在RLVR过程中却因过度惩罚而系统性地消失,导致探索能力的退化。针对此问题,我们提出了低概率正则化(Lp-Reg)。其核心机制通过将策略正则化至一个启发式代理分布来实现。该代理分布通过滤除疑似噪声词汇并对剩余候选词汇重新归一化构建而成,从而形成一个噪声较少的代理分布,其中推理火花的概率被放大,进而作为软正则化目标,通过KL散度保护这些宝贵词汇免遭淘汰。实验表明,Lp-Reg能够在约1000步的在线训练中保持稳定,而基线熵控制方法在此范围内已失效。这种持续的探索能力带来了最先进的性能表现,在五个数学基准测试中平均准确率达到60.17%,较之前方法提升了2.66%。代码已发布于https://github.com/CarlanLark/Lp-Reg。
English
Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large
Language Models in complex reasoning, yet its scalability is often hindered by
a training bottleneck where performance plateaus as policy entropy collapses,
signaling a loss of exploration. Previous methods typically address this by
maintaining high policy entropy, yet the precise mechanisms that govern
meaningful exploration have remained underexplored. Our analysis suggests that
an unselective focus on entropy risks amplifying irrelevant tokens and
destabilizing training. This paper investigates the exploration dynamics within
RLVR and identifies a key issue: the gradual elimination of valuable
low-probability exploratory tokens, which we term \textit{reasoning
sparks}. We find that while abundant in pre-trained models, these sparks are
systematically extinguished during RLVR due to over-penalization, leading to a
degeneracy in exploration. To address this, we introduce Low-probability
Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a
heuristic proxy distribution. This proxy is constructed by filtering out
presumed noise tokens and re-normalizing the distribution over the remaining
candidates. The result is a less-noisy proxy where the probability of
reasoning sparks is amplified, which then serves as a soft
regularization target to shield these valuable tokens from elimination via KL
divergence. Experiments show that Lp-Reg enables stable on-policy training for
around 1,000 steps, a regime where baseline entropy-control methods collapse.
This sustained exploration leads to state-of-the-art performance, achieving a
60.17% average accuracy on five math benchmarks, an improvement of 2.66%
over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.