基於預訓練數據的強化學習
Reinforcement Learning on Pre-Training Data
September 23, 2025
作者: Siheng Li, Kejiao Li, Zenan Xu, Guanhua Huang, Evander Yang, Kun Li, Haoyuan Wu, Jiajia Wu, Zihao Zheng, Chenchen Zhang, Kun Shi, Kyrierl Deng, Qi Yi, Ruibin Xiong, Tingqiang Xu, Yuhao Jiang, Jianfeng Yan, Yuyuan Zeng, Guanghui Xu, Jinbao Xue, Zhijiang Xu, Zheng Fang, Shuai Li, Qibin Liu, Xiaoxue Li, Zhuoyu Li, Yangyu Tao, Fei Gao, Cheng Jiang, Bo Chao Wang, Kai Liu, Jianchen Zhu, Wai Lam, Wayyt Wang, Bo Zhou, Di Wang
cs.AI
摘要
計算資源的指數級增長與高質量文本數據的有限增長之間日益擴大的差距,現已制約了大型語言模型(LLMs)傳統的擴展方法。為應對這一挑戰,我們引入了基於預訓練數據的強化學習(Reinforcement Learning on Pre-Training data, RLPT),這是一種新的訓練時擴展範式,旨在優化LLMs。與以往主要通過監督學習來擴展訓練的方法不同,RLPT使策略能夠自主探索有意義的軌跡,從預訓練數據中學習,並通過強化學習(RL)提升其能力。現有的RL策略,如基於人類反饋的強化學習(RLHF)和基於可驗證獎勵的強化學習(RLVR),依賴於人類標註來構建獎勵信號,而RLPT則消除了這一依賴,直接從預訓練數據中提取獎勵信號。具體而言,它採用了下一段推理目標,獎勵策略在給定前文的情況下準確預測後續文本片段的能力。這一公式化使得RL能夠在預訓練數據上進行擴展,鼓勵在更廣泛的上下文中探索更豐富的軌跡,從而培養更具泛化性的推理能力。在多個模型上進行的廣泛實驗,涵蓋通用領域和數學推理基準,驗證了RLPT的有效性。例如,當應用於Qwen3-4B-Base時,RLPT在MMLU、MMLU-Pro、GPQA-Diamond、KOR-Bench、AIME24和AIME25上分別帶來了3.0、5.1、8.1、6.0、6.6和5.3的絕對提升。結果進一步展示了良好的擴展行為,表明隨著計算資源的增加,持續增益的潛力巨大。此外,RLPT為LLMs的推理邊界提供了堅實的基礎,並提升了RLVR的性能。
English
The growing disparity between the exponential scaling of computational
resources and the finite growth of high-quality text data now constrains
conventional scaling approaches for large language models (LLMs). To address
this challenge, we introduce Reinforcement Learning on Pre-Training data
(RLPT), a new training-time scaling paradigm for optimizing LLMs. In contrast
to prior approaches that scale training primarily through supervised learning,
RLPT enables the policy to autonomously explore meaningful trajectories to
learn from pre-training data and improve its capability through reinforcement
learning (RL). While existing RL strategies such as reinforcement learning from
human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR)
rely on human annotation for reward construction, RLPT eliminates this
dependency by deriving reward signals directly from pre-training data.
Specifically, it adopts a next-segment reasoning objective, rewarding the
policy for accurately predicting subsequent text segments conditioned on the
preceding context. This formulation allows RL to be scaled on pre-training
data, encouraging the exploration of richer trajectories across broader
contexts and thereby fostering more generalizable reasoning skills. Extensive
experiments on both general-domain and mathematical reasoning benchmarks across
multiple models validate the effectiveness of RLPT. For example, when applied
to Qwen3-4B-Base, RLPT yields absolute improvements of 3.0, 5.1, 8.1,
6.0, 6.6, and 5.3 on MMLU, MMLU-Pro, GPQA-Diamond, KOR-Bench, AIME24, and
AIME25, respectively. The results further demonstrate favorable scaling
behavior, suggesting strong potential for continued gains with more compute. In
addition, RLPT provides a solid foundation, extending the reasoning boundaries
of LLMs and enhancing RLVR performance.