P(y|x)에서 P(y)로: 사전 학습 공간에서의 강화 학습 탐구

초록

검증 가능한 보상을 활용한 강화 학습(RLVR)은 조건부 분포 P(y|x)를 최적화하여 LLM 추론을 크게 향상시키지만, 그 잠재력은 기본 모델의 기존 출력 분포에 의해 근본적으로 제한됩니다. 한계 분포 P(y)를 사전 학습 공간에서 최적화하는 접근법은 추론 능력을 인코딩하고 광범위한 탐색 능력을 보존함으로써 이러한 병목 현상을 해결합니다. 그러나 기존의 사전 학습은 정적 코퍼스에 의존한 수동적 학습으로 인해 분포 변화가 발생하며, 이는 표적 추론 향상을 저해합니다. 본 논문에서는 보상 주도 온라인 업데이트를 P(y)에 직접 적용하는 PreRL(사전 학습 공간 강화 학습)을 소개합니다. 우리는 log P(y)와 log P(y|x) 간의 강력한 그래디언트 정렬을 이론 및 실증적으로 입증하여 PreRL이 표준 강화 학습의 실질적 대안이 될 수 있음을 확인했습니다. 더 나아가, PreRL 내부의 부정 샘플 강화(NSR) 메커니즘이 추론 향상의 매우 효과적인 동인으로 작용함을 발견했습니다. NSR-PreRL은 잘못된 추론 공간을 신속히 제거하면서 내생적 성찰 행동을 촉진하여 전환 사고와 반성 사고를 각각 14.89배, 6.54배 증가시켰습니다. 이러한 통찰을 바탕으로, 정교한 최적화를 위한 표준 강화 학습으로 전환하기 전에 NSR-PreRL로 모델을 초기화하여 추론 지평을 확장하는 정책 재탄생 전략인 이중 공간 강화 학습(DSRL)을 제안합니다. 광범위한 실험을 통해 DSRL이 강력한 베이스라인을 지속적으로 능가함을 입증하였으며, 사전 학습 공간 정제가 정책을 세련된 정확한 추론 부분 공간으로 효과적으로 유도함을 증명했습니다.

English

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

P(y|x)에서 P(y)로: 사전 학습 공간에서의 강화 학습 탐구

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

초록

Support