ChatPaper.aiChatPaper

Webscale-RL:面向预训练规模的强化学习数据自动化处理管道

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

October 7, 2025
作者: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
cs.AI

摘要

大型语言模型(LLMs)通过在海量文本语料库上进行模仿学习取得了显著成功,但这一范式造成了训练与生成之间的鸿沟,并限制了模型的稳健推理能力。强化学习(RL)提供了一种更为数据高效的解决方案,能够弥合这一差距,然而其应用一直受到一个关键数据瓶颈的制约:现有的RL数据集在规模和多样性上远不及网络规模的预训练语料库。为解决这一问题,我们引入了Webscale-RL管道,这是一个可扩展的数据引擎,能够系统地将大规模预训练文档转化为数百万个多样且可验证的问答对,用于RL训练。利用这一管道,我们构建了Webscale-RL数据集,包含超过9个领域的120万个示例。实验表明,基于该数据集训练的模型在一系列基准测试中显著优于持续预训练和强大的数据精炼基线方法。值得注意的是,使用我们的数据集进行RL训练效率大幅提升,仅需持续预训练1/100的token量即可达到同等性能。我们的工作为将RL扩展至预训练规模开辟了一条可行路径,助力开发出更强大、更高效的语言模型。
English
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100times fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
PDF312October 13, 2025