PretrainZero:強化式主動預訓練
PretrainZero: Reinforcement Active Pretraining
December 3, 2025
作者: Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang
cs.AI
摘要
模仿人類行為以主動從通用經驗中學習,從而實現人工通用智慧,一直是人類的夢想。近期基於強化學習的大型思維模型展現出令人印象深刻的專家級能力(如軟體開發與數學),但仍嚴重依賴特定領域中可驗證的獎勵機制,這對擴展通用推理能力的性能邊界形成了顯著瓶頸。本文提出PretrainZero——一個基於預訓練語料庫的強化主動學習框架,旨在將強化學習從領域特定的後訓練階段擴展至通用預訓練階段。PretrainZero具備以下特點:1)主動預訓練:受人類主動學習能力啟發,PretrainZero學習統一推理策略,主動從預訓練語料中識別合理且信息豐富的內容,並通過強化學習對這些內容進行預測推理;2)自監督學習:無需任何可驗證標籤、預訓練獎勵模型或監督微調,我們直接使用強化學習在通用維基百科語料庫上對3B至30B的基礎模型進行預訓練,顯著突破了通用推理的驗證數據壁壘;3)驗證規模化:通過處理難度遞增的掩碼片段,PretrainZero大幅增強了預訓練基礎模型的通用推理能力。在強化預訓練中,PretrainZero將Qwen3-4B-Base模型在MMLU-Pro、SuperGPQA和數學平均基準上的表現分別提升8.43、5.96和10.60分。在後訓練階段,經過預訓練的模型亦可作為下游RLVR任務的推理基礎模型。
English
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.