低概率詞元在可驗證獎勵的強化學習中維持探索

摘要

可驗證獎勵的強化學習（RLVR）已推動大型語言模型在複雜推理中的應用，但其可擴展性常因訓練瓶頸而受限，即當策略熵崩潰時，性能趨於平穩，這表明探索能力的喪失。以往的方法通常通過保持高策略熵來應對這一問題，然而，控制有意義探索的精確機制仍未得到充分探討。我們的分析表明，對熵的無選擇性關注可能會放大不相關的標記並使訓練不穩定。本文研究了RLVR內的探索動態，並發現了一個關鍵問題：有價值的低概率探索性標記（我們稱之為\textit{推理火花}）的逐漸消失。我們發現，這些火花在預訓練模型中雖然豐富，但在RLVR過程中由於過度懲罰而被系統性地消除，導致探索的退化。為解決這一問題，我們引入了低概率正則化（Lp-Reg）。其核心機制是將策略正則化到一個啟發式代理分佈上。該代理分佈通過過濾掉假定的噪聲標記並對剩餘候選進行重新歸一化來構建。結果是一個噪聲較少的代理分佈，其中推理火花的概率被放大，然後作為一個軟正則化目標，通過KL散度保護這些有價值的標記不被消除。實驗表明，Lp-Reg能夠在約1,000步的範圍內實現穩定的在線訓練，而基線的熵控制方法在此範圍內會崩潰。這種持續的探索帶來了最先進的性能，在五個數學基準測試中達到了60.17%的平均準確率，比之前的方法提高了2.66%。代碼可在https://github.com/CarlanLark/Lp-Reg獲取。

English

Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textit{reasoning sparks}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of reasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

低概率詞元在可驗證獎勵的強化學習中維持探索

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

摘要

Support