基於預訓練數據的強化學習

摘要

計算資源的指數級增長與高質量文本數據的有限增長之間日益擴大的差距，現已制約了大型語言模型（LLMs）傳統的擴展方法。為應對這一挑戰，我們引入了基於預訓練數據的強化學習（Reinforcement Learning on Pre-Training data, RLPT），這是一種新的訓練時擴展範式，旨在優化LLMs。與以往主要通過監督學習來擴展訓練的方法不同，RLPT使策略能夠自主探索有意義的軌跡，從預訓練數據中學習，並通過強化學習（RL）提升其能力。現有的RL策略，如基於人類反饋的強化學習（RLHF）和基於可驗證獎勵的強化學習（RLVR），依賴於人類標註來構建獎勵信號，而RLPT則消除了這一依賴，直接從預訓練數據中提取獎勵信號。具體而言，它採用了下一段推理目標，獎勵策略在給定前文的情況下準確預測後續文本片段的能力。這一公式化使得RL能夠在預訓練數據上進行擴展，鼓勵在更廣泛的上下文中探索更豐富的軌跡，從而培養更具泛化性的推理能力。在多個模型上進行的廣泛實驗，涵蓋通用領域和數學推理基準，驗證了RLPT的有效性。例如，當應用於Qwen3-4B-Base時，RLPT在MMLU、MMLU-Pro、GPQA-Diamond、KOR-Bench、AIME24和AIME25上分別帶來了3.0、5.1、8.1、6.0、6.6和5.3的絕對提升。結果進一步展示了良好的擴展行為，表明隨著計算資源的增加，持續增益的潛力巨大。此外，RLPT為LLMs的推理邊界提供了堅實的基礎，並提升了RLVR的性能。

English

The growing disparity between the exponential scaling of computational resources and the finite growth of high-quality text data now constrains conventional scaling approaches for large language models (LLMs). To address this challenge, we introduce Reinforcement Learning on Pre-Training data (RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast to prior approaches that scale training primarily through supervised learning, RLPT enables the policy to autonomously explore meaningful trajectories to learn from pre-training data and improve its capability through reinforcement learning (RL). While existing RL strategies such as reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR) rely on human annotation for reward construction, RLPT eliminates this dependency by deriving reward signals directly from pre-training data. Specifically, it adopts a next-segment reasoning objective, rewarding the policy for accurately predicting subsequent text segments conditioned on the preceding context. This formulation allows RL to be scaled on pre-training data, encouraging the exploration of richer trajectories across broader contexts and thereby fostering more generalizable reasoning skills. Extensive experiments on both general-domain and mathematical reasoning benchmarks across multiple models validate the effectiveness of RLPT. For example, when applied to Qwen3-4B-Base, RLPT yields absolute improvements of 3.0, 5.1, 8.1, 6.0, 6.6, and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and AIME25, respectively. The results further demonstrate favorable scaling behavior, suggesting strong potential for continued gains with more compute. In addition, RLPT provides a solid foundation, extending the reasoning boundaries of LLMs and enhancing RLVR performance.

基於預訓練數據的強化學習

Reinforcement Learning on Pre-Training Data

摘要

Support