从P(y|x)到P(y)：探索预训练空间中的强化学习

摘要

尽管带可验证奖励的强化学习（RLVR）通过优化条件分布P(y|x)显著增强了大语言模型的推理能力，但其潜力从根本上受限于基础模型现有的输出分布。在预训练空间中优化边缘分布P(y)能突破这一瓶颈，既能编码推理能力又可保持广泛探索潜力。然而传统预训练依赖静态语料库进行被动学习，会导致分布偏移从而阻碍针对性推理增强。本文提出预训练空间强化学习（PreRL），将奖励驱动的在线更新直接应用于P(y)。我们从理论与实验双重角度验证了log P(y)与log P(y|x)之间的强梯度对齐性，确立了PreRL作为标准RL可行替代方案的可靠性。更重要的是，我们发现了PreRL中的关键机制：负样本强化（NSR）能成为推理能力的超常驱动引擎。NSR-PreRL可快速剪枝错误推理空间，同时激发内生反思行为，使转换思维和反思思维分别提升14.89倍和6.54倍。基于这些发现，我们提出双空间强化学习（DSRL）——一种策略重生方案：先通过NSR-PreRL初始化模型以拓展推理边界，再转入标准RL进行细粒度优化。大量实验表明DSRL持续超越强基线模型，证明预训练空间剪枝能有效引导策略朝向精炼的正确推理子空间演进。

English

While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.