ChatPaper.aiChatPaper

RePro:訓練語言模型忠實地回收網路資料進行預訓練

RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

October 12, 2025
作者: Zichun Yu, Chenyan Xiong
cs.AI

摘要

高品質的預訓練數據是大型語言模型(LLMs)的化石燃料,然而對於前沿模型而言,其儲備正逐漸枯竭。本文介紹了RePro,一種新穎的網絡回收方法,它通過強化學習訓練一個相對較小的語言模型,以生成有效且忠實的預訓練數據重述。具體而言,我們設計了一個質量獎勵和三個忠實度獎勵,優化語言模型重述器,將原始數據轉化為高品質的重述,同時保持其核心語義和結構。在實驗中,我們訓練了一個40億參數的重述器,從DCLM-RefinedWeb中回收了720億個令牌。在4億和14億參數模型上的預訓練結果顯示,RePro在22個下游任務上相比僅使用原始數據的基線,帶來了4.7%-14.0%的相對準確率提升。RePro還超越了ReWire,這是一種最先進的網絡回收方法,它提示了一個700億參數的重述器,以及一個數據池大四倍的原始數據基線。不同回收數據量的實驗表明,RePro將原始數據效率提高了2-3倍。個體和分佈分析驗證了與基於提示的方法相比,RePro保留了更多關鍵信息,並忠實地反映了原始數據的特徵。這些結果共同表明,RePro提供了一條高效且可控的路徑,以有效利用LLM預訓練的化石燃料。我們在https://github.com/cxcscmu/RePro開源了我們的代碼、重述器及回收數據。
English
High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.
PDF42October 14, 2025