ChatPaper.aiChatPaper

**PretrainZero:强化式主动预训练**

PretrainZero: Reinforcement Active Pretraining

December 3, 2025
作者: Xingrun Xing, Zhiyuan Fan, Jie Lou, Guoqi Li, Jiajun Zhang, Debing Zhang
cs.AI

摘要

模仿人类行为以主动从通用经验中学习,并实现通用人工智能,一直是人类的梦想。基于强化学习的大规模思维模型近期展现出令人瞩目的专家级能力(如软件与数学领域),但仍严重依赖特定领域内可验证的奖励机制,这极大限制了通用推理能力边界的拓展。本文提出PretrainZero——一个基于预训练语料的强化主动学习框架,将强化学习从领域特定的后训练阶段扩展至通用预训练阶段。PretrainZero具备以下特性:1)主动预训练:受人类主动学习能力启发,该框架学习统一推理策略,主动从预训练语料中识别合理且信息丰富的内容,并通过强化学习进行预测推理;2)自监督学习:无需任何可验证标签、预训练奖励模型或监督微调,直接在通用维基百科语料上对3B至30B基础模型进行强化学习预训练,显著突破通用推理的验证数据壁垒;3)验证规模化:通过攻克难度递增的掩码片段,该框架大幅提升预训练基础模型的通用推理能力。在强化预训练中,PretrainZero将Qwen3-4B-Base模型在MMLU-Pro、SuperGPQA和数学综合基准上的表现分别提升8.43、5.96和10.60分。在后训练阶段,经预训练的模型还可作为下游RLVR任务的推理基础模型。
English
Mimicking human behavior to actively learning from general experience and achieve artificial general intelligence has always been a human dream. Recent reinforcement learning (RL) based large-thinking models demonstrate impressive expert-level abilities, i.e., software and math, but still rely heavily on verifiable rewards in specific domains, placing a significant bottleneck to extend the performance boundary of general reasoning capabilities. In this work, we propose PretrainZero, a reinforcement active learning framework built on the pretraining corpus to extend RL from domain-specific post-training to general pretraining. PretrainZero features the following characteristics: 1) Active pretraining: inspired by the active learning ability of humans, PretrainZero learns a unified reasoning policy to actively identify reasonable and informative contents from pretraining corpus, and reason to predict these contents by RL. 2) Self-supervised learning: without any verifiable labels, pretrained reward models, or supervised fine-tuning, we directly pretrain reasoners from 3 to 30B base models on the general Wikipedia corpus using RL, significantly breaking the verification data-wall for general reasoning. 3) Verification scaling: by tackling increasingly challenging masked spans, PretrainZero substantially enhances the general reasoning abilities of pretrained base models. In reinforcement pretraining, PretrainZero improves Qwen3-4B-Base for 8.43, 5.96 and 10.60 on MMLU-Pro, SuperGPQA and math average benchmarks. In post-training, the pretrained models can also serve as reasoning foundation models for downstream RLVR tasks.
PDF261December 5, 2025