ChatPaper.aiChatPaper

從P(y|x)到P(y):探索預訓練空間中的強化學習

From P(y|x) to P(y): Investigating Reinforcement Learning in Pre-train Space

April 15, 2026
作者: Yuqiao Tan, Minzheng Wang, Bo Liu, Zichen Liu, Tian Liang, Shizhu He, Jun Zhao, Kang Liu
cs.AI

摘要

雖然具可驗證獎勵的強化學習(RLVR)通過優化條件分佈P(y|x)顯著增強了大語言模型的推理能力,但其潛力從根本上受制於基礎模型現有的輸出分佈。在預訓練空間中優化邊際分佈P(y)能突破此瓶頸,該方法既能編碼推理能力又可保持廣泛的探索容量。然而傳統預訓練依賴靜態語料庫進行被動學習,會導致分佈偏移從而阻礙針對性推理增強。本文提出預訓練空間強化學習(PreRL),將獎勵驅動的在線更新直接應用於P(y)。我們從理論和實證角度驗證了log P(y)與log P(y|x)之間的強梯度對齊關係,確立PreRL可作為標準RL的有效替代方案。更重要的是,我們發現PreRL中的負樣本強化(NSR)機制是驅動推理的關鍵:NSR-PreRL能快速剪除錯誤推理空間,同時激發內生性反思行為,使轉換思維和反思思維分別提升14.89倍和6.54倍。基於這些發現,我們提出雙空間強化學習(DSRL)——一種策略重生方案:先通過NSR-PreRL初始化模型以擴展推理邊界,再轉向標準RL進行細粒度優化。大量實驗表明DSRL持續超越強基線模型,證明預訓練空間剪枝能有效引導策略朝向精煉的正確推理子空間演進。
English
While reinforcement learning with verifiable rewards (RLVR) significantly enhances LLM reasoning by optimizing the conditional distribution P(y|x), its potential is fundamentally bounded by the base model's existing output distribution. Optimizing the marginal distribution P(y) in the Pre-train Space addresses this bottleneck by encoding reasoning ability and preserving broad exploration capacity. Yet, conventional pre-training relies on static corpora for passive learning, leading to a distribution shift that hinders targeted reasoning enhancement. In this paper, we introduce PreRL (Pre-train Space RL), which applies reward-driven online updates directly to P(y). We theoretically and empirically validate the strong gradient alignment between log P(y) and log P(y|x), establishing PreRL as a viable surrogate for standard RL. Furthermore, we uncover a critical mechanism: Negative Sample Reinforcement (NSR) within PreRL serves as an exceptionally effective driver for reasoning. NSR-PreRL rapidly prunes incorrect reasoning spaces while stimulating endogenous reflective behaviors, increasing transition and reflection thoughts by 14.89x and 6.54x, respectively. Leveraging these insights, we propose Dual Space RL (DSRL), a Policy Reincarnation strategy that initializes models with NSR-PreRL to expand the reasoning horizon before transitioning to standard RL for fine-grained optimization. Extensive experiments demonstrate that DSRL consistently outperforms strong baselines, proving that pre-train space pruning effectively steers the policy toward a refined correct reasoning subspace.
PDF231April 17, 2026