フロントローディング推論：事前学習と事後学習データの相乗効果

要旨

大規模言語モデル（LLM）の推論能力を向上させるための主流のパラダイムは、高品質で推論集約的なデータを用いた事後学習に焦点を当てている。近年の研究では、推論データが中間学習段階においても取り入れられる傾向が増えていることが示唆されているが、この手法は比較的プロプライエタリであり、公開される情報が少ない。特に、最先端モデルの事前学習コーパスの不透明性から、事前学習および事後学習の異なる段階で導入された推論データの効果に関する科学的な報告は比較的少ない。これにより、いくつかの重要な疑問が浮かび上がる：事前学習の早い段階で推論データを追加することは、事後学習で導入するよりも優れているのか？早期の導入は過剰適合を引き起こし、汎化能力を損なうリスクがあるのか、それとも後続のファインチューニングでは回復できない堅固な基盤を確立するのか？本研究では、規模、多様性、品質が異なる推論データが、学習の異なる段階で導入された場合にLLMの性能にどのような影響を与えるかを初めて体系的に調査した。その結果、事前学習に推論データを早期に導入することが重要であること（平均19％の向上）が明らかとなり、後段階のSFT（Supervised Fine-Tuning）では、たとえより多くのデータを用いても完全に再現できない基盤能力が確立されることがわかった。また、最適なデータ配分に関する非対称的な原則を発見した：事前学習は推論パターンの広範な多様性から最も大きな利益を得る（平均11％の向上）一方、SFTはデータの品質に対してより敏感である（平均15％の向上）。さらに、高品質な事前学習データには潜在的な効果があり、SFT後にのみ活性化されること、そしてSFTデータを単純にスケールアップすることが逆効果となり、早期の推論導入の利点を打ち消す可能性があることを示した。これらの結果は、言語モデリングと推論を分離する従来の考え方に挑戦し、より能力の高いモデルを構築するために、学習パイプライン全体にわたってデータを戦略的に配分するための原則的な指針を提供する。

English

The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

フロントローディング推論：事前学習と事後学習データの相乗効果

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

要旨

Support