RePro:训练语言模型以忠实回收网络数据用于预训练
RePro: Training Language Models to Faithfully Recycle the Web for Pretraining
October 12, 2025
作者: Zichun Yu, Chenyan Xiong
cs.AI
摘要
高质量预训练数据是大型语言模型(LLMs)的“化石燃料”,然而对于前沿模型而言,其储备正日益枯竭。本文提出RePro,一种新颖的网络数据回收方法,通过强化学习训练一个相对较小的语言模型,以生成既有效又忠实于原意的预训练数据重述。具体而言,我们设计了一项质量奖励和三项忠实度奖励,优化语言模型重述器,将原始数据转化为高质量的重述,同时保持其核心语义与结构。实验中,我们训练了一个40亿参数的重述器,回收了来自DCLM-RefinedWeb的720亿个标记。在4亿和14亿参数模型上的预训练结果显示,RePro在22项下游任务中,相较于仅使用原始数据的基线,实现了4.7%至14.0%的相对准确率提升。RePro还超越了当前最先进的网络数据回收方法ReWire——该方法利用700亿参数的重述器进行提示生成——以及数据量扩大四倍的原始数据基线。不同回收数据量的实验表明,RePro将原始数据效率提高了2至3倍。个体与分布分析验证,与基于提示的方法相比,RePro保留了更多关键信息,并更忠实地反映了原始数据的特征。综合这些结果,RePro为高效且可控地利用LLM预训练的“化石燃料”提供了一条有效路径。我们在https://github.com/cxcscmu/RePro开源了代码、重述器及回收数据。
English
High-quality pretraining data is the fossil fuel of large language models
(LLMs), yet its reserves are running low for frontier models. In this paper, we
introduce RePro, a novel web recycling method that trains a relatively small LM
with reinforcement learning to generate effective and faithful rephrasings of
pretraining data. Specifically, we design one quality reward and three
faithfulness rewards, optimizing the LM rephraser to convert organic data into
high-quality rephrasings while maintaining its core semantics and structure. In
our experiment, we train a 4B rephraser to recycle 72B tokens sampled from
DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that
RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on
22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web
recycling method that prompts a 70B rephraser, as well as the organic baseline
with a 4x larger data pool. Experiments with different amounts of recycled data
highlight that RePro improves organic data efficiency by 2-3x. Individual and
distributional analyses validate that RePro preserves more critical information
and faithfully reflects the characteristics of organic data compared to
prompting-based methods. Together, these results show that RePro provides an
efficient and controllable path to effectively harness the fossil fuel of LLM
pretraining. We open-source our code, rephraser, and recycled data at
https://github.com/cxcscmu/RePro.