RLP：强化学习作为预训练目标

摘要

当前训练大型推理模型的主流范式始于在大量数据上使用下一词预测损失进行预训练。尽管强化学习在扩展推理能力方面表现出强大潜力，但它仅作为训练后的最后阶段引入，且在此之前需进行监督微调。然而，这种主流方式是否最优？本文提出了RLP，一种信息驱动的强化预训练目标，它将强化学习的核心精神——探索——引入预训练的最后阶段。其核心思想是将思维链视为一种探索性动作，奖励则基于其对预测未来词汇所提供的信息增益来计算。这一训练目标实质上鼓励模型在预测下一词之前进行独立思考，从而在预训练早期就培养出独立思维行为。具体而言，奖励信号衡量了在同时考虑上下文和采样推理链时，下一词对数似然相较于仅考虑上下文时的提升。该方法生成了一种无需验证器的密集奖励信号，使得在预训练期间能够高效地处理完整文档流。特别地，RLP将推理的强化学习重新定义为普通文本上的预训练目标，弥合了下一词预测与有效思维链推理出现之间的鸿沟。在Qwen3-1.7B-Base模型上应用RLP进行预训练，使得在包含八个数学与科学基准测试套件上的整体平均提升了19%。在相同的训练后阶段，收益进一步累积，尤其是在推理密集型任务如AIME25和MMLU-Pro上提升最为显著。将RLP应用于混合模型Nemotron-Nano-12B-v2，其整体平均分从42.81%提升至61.32%，科学推理平均分提升了23%，展现了该方法在不同架构和模型规模上的可扩展性。

English

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning -- exploration -- to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight-benchmark math-and-science suite by 19%. With identical post-training, the gains compound, with the largest improvements on reasoning-heavy tasks such as AIME25 and MMLU-Pro. Applying RLP to the hybrid Nemotron-Nano-12B-v2 increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

RLP：强化学习作为预训练目标

RLP: Reinforcement as a Pretraining Objective

摘要

Support