전방향 추론: 사전 학습과 사후 학습 데이터 간의 시너지

초록

LLM(대형 언어 모델)의 추론 능력을 향상시키기 위한 현재의 주요 패러다임은 고품질의 추론 집약적 데이터에 대한 사후 학습(post-training)에 초점을 맞추고 있습니다. 최근 연구들은 추론 데이터가 중간 학습 단계에서도 점점 더 통합되고 있다고 제안하지만, 이러한 관행은 상대적으로 독점적이며 공개적으로 명확히 설명되지 않는 경우가 많습니다. 특히, 대부분의 최첨단 모델에서 사전 학습(pre-training) 코퍼스의 불투명성으로 인해, 사전 학습 및/또는 사후 학습의 다양한 단계에서 도입된 추론 데이터의 효과는 과학적 문헌에서 상대적으로 덜 보고되고 있습니다. 이는 몇 가지 중요한 질문을 제기합니다: 사전 학습 초기에 추론 데이터를 추가하는 것이 사후 학습 중에 도입하는 것보다 더 나은가? 초기 포함이 과적합을 유발하고 일반화를 해칠 위험이 있는가, 아니면 나중에 미세 조정(fine-tuning)으로는 복구할 수 없는 견고한 기반을 마련할 수 있는가? 우리는 규모, 다양성, 품질이 다른 추론 데이터가 학습의 다양한 단계에서 도입될 때 LLM 성능에 미치는 영향을 체계적으로 연구한 첫 번째 연구를 수행했습니다. 우리는 추론 데이터를 사전 학습에 앞서 도입하는 것이 매우 중요하며(평균 19% 향상), 이는 나중 단계의 SFT(Supervised Fine-Tuning)로는 완전히 복제할 수 없는 기초 능력을 확립한다는 것을 발견했습니다. 우리는 최적의 데이터 할당을 위한 비대칭 원칙을 발견했습니다: 사전 학습은 추론 패턴의 광범위한 다양성에서 가장 큰 이점을 얻는 반면(평균 11% 향상), SFT는 데이터 품질에 더 민감합니다(평균 15% 향상). 우리는 고품질의 사전 학습 데이터가 SFT 이후에만 활성화되는 잠재 효과를 가지고 있으며, SFT 데이터를 무작정 확장하는 것은 초기 추론 주입의 이점을 상쇄시킬 수 있다는 것을 보여줍니다. 우리의 결과는 언어 모델링과 추론의 전통적인 분리를 도전하며, 더 능력 있는 모델을 구축하기 위해 전체 학습 파이프라인에 걸쳐 데이터를 전략적으로 할당하는 원칙적인 가이드를 제공합니다.

English

The prevailing paradigm for enhancing the reasoning abilities of LLMs revolves around post-training on high-quality, reasoning-intensive data. While emerging literature suggests that reasoning data is increasingly incorporated also during the mid-training stage-a practice that is relatively more proprietary and less openly characterized-the role of such data in pretraining remains unclear. In particular, due to the opaqueness of pretraining corpora in most frontier models, the effect of reasoning data introduced at different phases of pre- and/or post-training is relatively less reported in the scientific literature. This raises several important questions: Is adding reasoning data earlier during pretraining any better than introducing it during post-training? Could earlier inclusion risk overfitting and harm generalization, or instead establish durable foundations that later fine-tuning cannot recover? We conduct the first systematic study of how reasoning data-varying in scale, diversity, and quality-affects LLM performance when introduced at different stages of training. We find that front-loading reasoning data into pretraining is critical (19% avg gain), establishing foundational capabilities that cannot be fully replicated by later-stage SFT, even with more data. We uncover an asymmetric principle for optimal data allocation: pretraining benefits most from broad diversity in reasoning patterns (11% avg gain), while SFT is more sensitive to data quality (15% avg gain). We show that high-quality pretraining data has latent effects, activated only after SFT, and that naively scaling SFT data can be detrimental, washing away the benefits of early reasoning injection. Our results challenge the conventional separation of language modeling and reasoning, providing a principled guide for strategically allocating data across the entire training pipeline to build more capable models.

전방향 추론: 사전 학습과 사후 학습 데이터 간의 시너지

Front-Loading Reasoning: The Synergy between Pretraining and Post-Training Data

초록

Support