ChatPaper.aiChatPaper

Webscale-RL:自動化數據管道,將強化學習數據擴展至預訓練規模

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

October 7, 2025
作者: Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
cs.AI

摘要

大型語言模型(LLMs)通過對海量文本語料進行模仿學習取得了顯著成功,但這種範式造成了訓練與生成之間的差距,並限制了模型的穩健推理能力。強化學習(RL)提供了一種更為數據高效的解決方案,能夠彌合這一差距,然而其應用一直受到一個關鍵數據瓶頸的制約:現有的RL數據集在規模和多樣性上均遠不及網絡規模的預訓練語料。為解決這一問題,我們引入了Webscale-RL管道,這是一個可擴展的數據引擎,能夠系統地將大規模預訓練文檔轉化為數百萬個多樣且可驗證的問答對,用於RL訓練。利用這一管道,我們構建了Webscale-RL數據集,包含超過9個領域的120萬個示例。實驗表明,基於該數據集訓練的模型在一系列基準測試中顯著優於持續預訓練和強數據精煉基線。值得注意的是,使用我們的數據集進行RL訓練顯著提高了效率,在僅需最多100倍少於持續預訓練的token數量的情況下,達到了與之相當的性能。我們的工作為將RL擴展至預訓練規模提供了一條可行路徑,從而實現更強大、更高效的語言模型。
English
Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100times fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.
PDF312October 13, 2025