검증 가능한 보상을 갖는 강화 학습에서 낮은 확률 토큰이 탐색을 유지한다

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델의 복잡한 추론 능력을 향상시켰지만, 정책 엔트로피가 붕괴되면서 성능이 정체되는 훈련 병목 현상으로 인해 확장성이 제한되는 경우가 많습니다. 기존 방법들은 일반적으로 높은 정책 엔트로피를 유지함으로써 이를 해결하려 했지만, 의미 있는 탐색을 조절하는 정확한 메커니즘은 충분히 연구되지 않았습니다. 우리의 분석에 따르면, 엔트로피에 대한 무분별한 초점은 관련 없는 토큰을 증폭시키고 훈련을 불안정하게 만들 위험이 있습니다. 본 논문은 RLVR 내의 탐색 동역학을 조사하고, 중요한 문제를 식별합니다: 바로 가치 있는 낮은 확률의 탐색 토큰이 점진적으로 제거되는 현상으로, 우리는 이를 \textit{추론 스파크}라고 명명합니다. 사전 훈련된 모델에서는 이러한 스파크가 풍부하지만, RLVR 과정에서 과도한 페널티로 인해 체계적으로 소멸되며, 이는 탐색의 퇴화로 이어집니다. 이를 해결하기 위해 우리는 낮은 확률 정규화(Lp-Reg)를 도입합니다. 이 방법의 핵심 메커니즘은 정책을 휴리스틱 프록시 분포로 정규화하는 것입니다. 이 프록시는 잡음으로 간주되는 토큰을 필터링하고 남은 후보들에 대해 분포를 재정규화함으로써 구성됩니다. 그 결과, 추론 스파크의 확률이 증폭된 덜 잡음이 있는 프록시가 생성되며, 이는 KL 발산을 통해 이러한 가치 있는 토큰이 제거되지 않도록 보호하는 부드러운 정규화 목표로 작용합니다. 실험 결과, Lp-Reg는 약 1,000단계 동안 안정적인 온-정책 훈련을 가능하게 하며, 이는 기존 엔트로피 제어 방법들이 붕괴되는 영역입니다. 이러한 지속적인 탐색은 최신 성능을 달성하며, 다섯 개의 수학 벤치마크에서 평균 60.17%의 정확도를 기록하여 기존 방법 대비 2.66%의 향상을 보입니다. 코드는 https://github.com/CarlanLark/Lp-Reg에서 확인할 수 있습니다.

English

Reinforcement Learning with Verifiable Rewards (RLVR) has propelled Large Language Models in complex reasoning, yet its scalability is often hindered by a training bottleneck where performance plateaus as policy entropy collapses, signaling a loss of exploration. Previous methods typically address this by maintaining high policy entropy, yet the precise mechanisms that govern meaningful exploration have remained underexplored. Our analysis suggests that an unselective focus on entropy risks amplifying irrelevant tokens and destabilizing training. This paper investigates the exploration dynamics within RLVR and identifies a key issue: the gradual elimination of valuable low-probability exploratory tokens, which we term \textit{reasoning sparks}. We find that while abundant in pre-trained models, these sparks are systematically extinguished during RLVR due to over-penalization, leading to a degeneracy in exploration. To address this, we introduce Low-probability Regularization (Lp-Reg). Its core mechanism regularizes the policy towards a heuristic proxy distribution. This proxy is constructed by filtering out presumed noise tokens and re-normalizing the distribution over the remaining candidates. The result is a less-noisy proxy where the probability of reasoning sparks is amplified, which then serves as a soft regularization target to shield these valuable tokens from elimination via KL divergence. Experiments show that Lp-Reg enables stable on-policy training for around 1,000 steps, a regime where baseline entropy-control methods collapse. This sustained exploration leads to state-of-the-art performance, achieving a 60.17% average accuracy on five math benchmarks, an improvement of 2.66% over prior methods. Code is available at https://github.com/CarlanLark/Lp-Reg.

검증 가능한 보상을 갖는 강화 학습에서 낮은 확률 토큰이 탐색을 유지한다

Low-probability Tokens Sustain Exploration in Reinforcement Learning with Verifiable Reward

초록

Support