前置推理：预训练与后训练数据的协同效应

摘要

当前提升大语言模型（LLM）推理能力的主流范式，主要依赖于对高质量、富含推理内容的数据进行后训练。尽管新兴研究表明，推理数据在中期训练阶段也逐渐被纳入——这一做法相对更具专属性且较少公开描述——但此类数据在预训练中的作用仍不明确。特别是，鉴于大多数前沿模型的预训练语料库不透明，关于在预训练和/或后训练的不同阶段引入推理数据的效果，在科学文献中相对较少报道。这引发了几个重要问题：在预训练早期加入推理数据是否比在后训练阶段引入更优？早期纳入是否会增加过拟合风险并损害泛化能力，还是能够奠定后期微调无法恢复的坚实基础？我们首次系统研究了推理数据——在规模、多样性和质量上有所变化——在不同训练阶段引入时对LLM性能的影响。研究发现，将推理数据前置到预训练阶段至关重要（平均提升19%），这建立了后期监督微调（SFT）即使使用更多数据也无法完全复制的基础能力。我们揭示了一个数据分配的最优非对称原则：预训练最受益于推理模式的广泛多样性（平均提升11%），而SFT则对数据质量更为敏感（平均提升15%）。我们展示了高质量预训练数据的潜在效应，这些效应仅在SFT后被激活，且盲目扩大SFT数据量可能适得其反，削弱早期推理注入的益处。我们的研究结果挑战了语言建模与推理的传统分离，为在整个训练流程中战略性地分配数据以构建更强大模型提供了原则性指导。

English

The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

前置推理：预训练与后训练数据的协同效应

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

摘要

Support