RLP：強化學習作為預訓練目標

摘要

訓練大型推理模型的主流範式始於使用下一個詞預測損失對海量數據進行預訓練。強化學習雖然在擴展推理能力方面強大，但僅作為訓練後的最後階段引入，並在監督微調之後進行。然而，這種主流方式是否是最優的訓練方法？在本文中，我們提出了RLP，一種信息驅動的強化預訓練目標，它將強化學習的核心精神——探索——帶入預訓練的最後階段。關鍵思想是將思維鏈視為一種探索性動作，其獎勵基於它為預測未來詞彙提供的信息增益來計算。這一訓練目標本質上鼓勵模型在預測下一個詞之前進行自主思考，從而更早地在預訓練中教授獨立思考的行為。更具體地說，獎勵信號衡量的是在同時考慮上下文和採樣的推理鏈時，下一個詞的對數似然相較於僅考慮上下文時的提升。這種方法產生了一種無需驗證器的密集獎勵信號，使得在預訓練期間能夠高效地訓練整個文檔流。具體而言，RLP將推理的強化學習重新定義為普通文本上的預訓練目標，彌合了下一個詞預測與有用思維鏈推理出現之間的差距。在Qwen3-1.7B-Base上使用RLP進行預訓練，將八個數學與科學基準套件的總體平均提升了19%。在相同的訓練後階段，增益疊加，在推理密集型任務如AIME25和MMLU-Pro上提升最大。將RLP應用於混合模型Nemotron-Nano-12B-v2，總體平均從42.81%提升至61.32%，科學推理的平均提升了23%，展示了跨架構和模型規模的可擴展性。

English

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

RLP：強化學習作為預訓練目標

RLP: Reinforcement as a Pretraining Objective

摘要

Support