P(y|x)からP(y)へ：事前学習空間における強化学習の探求

要旨

検証可能な報酬を用いた強化学習（RLVR）は条件付き分布P(y|x)を最適化することでLLMの推論能力を大幅に強化するが、その可能性は基本モデルが持つ既存の出力分布によって根本的に制限されている。周辺分布P(y)を事前学習空間で最適化するアプローチは、推論能力を符号化し広範な探索能力を保持することでこのボトルネックを解決する。しかし従来の事前学習は静的なコーパスに依存した受動的学習であるため、分布シフトが生じ、標的型の推論強化を妨げている。本論文では、報酬駆動型のオンライン更新をP(y)に直接適用するPreRL（Pre-train Space RL）を提案する。我々は理論的・実証的にlog P(y)とlog P(y|x)の強い勾配整合性を検証し、PreRLが標準RLの有効な代替手段であることを立証する。さらに、PreRL内の負例強化（NSR）が推論の極めて効果的な駆動力となる重要なメカニズムを発見した。NSR-PreRLは誤った推論空間を迅速に刈り込みながら内省的な反射行動を促進し、推移的思考と反射的思考をそれぞれ14.89倍、6.54倍増加させる。これらの知見を活かし、推論の地平を拡大するNSR-PreRLによるモデル初期化後、細粒度最適化のために標準RLに移行する政策転生戦略「Dual Space RL（DSRL）」を提案する。大規模な実験により、DSRLが強力なベースラインを一貫して上回り、事前学習空間の刈り込みが政策を洗練された正しい推論部分空間へと効果的に導くことを実証する。

English

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.

P(y|x)からP(y)へ：事前学習空間における強化学習の探求

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

要旨

Support