基于预训练数据的强化学习

摘要

计算资源呈指数级增长与高质量文本数据有限增长之间的差距日益扩大，如今制约了大型语言模型（LLMs）的传统扩展方法。为应对这一挑战，我们引入了预训练数据上的强化学习（Reinforcement Learning on Pre-Training data, RLPT），这是一种新的训练时扩展范式，旨在优化LLMs。与以往主要通过监督学习扩展训练的方法不同，RLPT使策略能够自主探索有意义的轨迹，从预训练数据中学习，并通过强化学习（RL）提升其能力。尽管现有的RL策略，如基于人类反馈的强化学习（RLHF）和可验证奖励的强化学习（RLVR），依赖人工标注构建奖励，RLPT则通过直接从预训练数据中提取奖励信号，消除了这一依赖。具体而言，它采用下一段推理目标，奖励策略在给定前文条件下准确预测后续文本片段的能力。这一公式化使得RL能够在预训练数据上扩展，鼓励在更广泛的上下文中探索更丰富的轨迹，从而培养更具普适性的推理技能。在多个模型上对通用领域和数学推理基准进行的广泛实验验证了RLPT的有效性。例如，应用于Qwen3-4B-Base时，RLPT在MMLU、MMLU-Pro、GPQA-Diamond、KOR-Bench、AIME24和AIME25上分别实现了3.0、5.1、8.1、6.0、6.6和5.3的绝对提升。结果进一步展示了良好的扩展行为，预示着随着计算资源的增加，持续增益的潜力巨大。此外，RLPT为扩展LLMs的推理边界和提升RLVR性能奠定了坚实基础。

English

The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of 3.0, 5.1, 8.1, 6.0, 6.6, and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

基于预训练数据的强化学习

Reinforcement Learning on Pre-Training Data

摘要

Support