ChatPaper.aiChatPaper

前置推理:預訓練與後訓練數據間的協同效應

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

September 26, 2025
作者: Syeda Nahida Akter, Shrimai Prabhumoye, Eric Nyberg, Mostofa Patwary, Mohammad Shoeybi, Yejin Choi, Bryan Catanzaro
cs.AI

摘要

提升大型語言模型(LLM)推理能力的主流範式,主要圍繞著對高質量、推理密集型數據進行後續訓練。儘管新興文獻表明,推理數據在中期訓練階段也日益被納入——這一做法相對更具專有性且較少公開描述——但此類數據在預訓練中的作用仍不明確。特別是由於大多數前沿模型的預訓練語料庫不透明,科學文獻中相對較少報告在預訓練和/或後續訓練的不同階段引入推理數據的效果。這引發了幾個重要問題:在預訓練早期加入推理數據是否比在後續訓練中引入更好?早期納入是否會導致過擬合並損害泛化能力,還是能建立後續微調無法恢復的持久基礎?我們首次系統地研究了推理數據——在規模、多樣性和質量上變化——在不同訓練階段引入時對LLM性能的影響。我們發現,將推理數據前置於預訓練階段至關重要(平均提升19%),這建立了後續階段SFT即使使用更多數據也無法完全複製的基礎能力。我們揭示了一個數據分配的最優不對稱原則:預訓練最能從推理模式的廣泛多樣性中受益(平均提升11%),而SFT對數據質量更為敏感(平均提升15%)。我們展示了高質量預訓練數據具有潛在效應,僅在SFT後被激活,且盲目擴展SFT數據可能有害,會沖淡早期推理注入的益處。我們的結果挑戰了語言建模與推理的傳統分離,為在整個訓練管道中戰略性地分配數據以構建更強大模型提供了原則性指導。
English
The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.
PDF204October 7, 2025