ReMiT: 반복적 LLM 진화를 위한 강화 학습 기반 중간 훈련

초록

대규모 언어 모델(LLM)의 표준 학습 파이프라인은 일반적으로 사전 학습에서 사후 학습으로 이어지는 단방향 과정을 따릅니다. 그러나 사후 학습 과정에서 얻은 통찰이 사전 학습된 기초 모델을 역으로 개선하는 양방향 과정의 가능성은 아직 탐구되지 않았습니다. 우리는 강화 학습(RL)으로 조정된 모델이 기본 모델을 강화하고, 이렇게 강화된 기본 모델이 다시 향후 사후 학습 성능을 향상시키는, 특별히 훈련된 교사 모델이나 참조 모델이 필요 없는 자기 강화형 플라이휠(flywheel) 구축을 목표로 합니다. 이를 실현하기 위해 우리는 학습 동역학을 분석하고 모델 능력에 있어 중간 학습(annealing) 단계가 중요한 전환점임을 확인했습니다. 이 단계는 일반적으로 사전 학습 말기에 발생하며, 급격히 감소하는 학습률 하에서 고품질 코퍼스를 활용합니다. 이러한 통찰을 바탕으로 우리는 ReMiT(Reinforcement Learning-Guided Mid-Training)를 제안합니다. 구체적으로 ReMiT는 RL 조정 모델의 추론 사전 지식(priors)을 활용하여 중간 학습 단계에서 토큰을 동적으로 재가중함으로써 추론에 핵심적인 토큰을 우선시합니다. 실험적으로 ReMiT는 수학, 코드, 일반 추론을 아우르는 10개의 사전 학습 벤치마크에서 평균 3%의 성능 향상을 달성했으며, 이러한 향상된 효과가 사후 학습 파이프라인 전반에 걸쳐 2% 이상 유지됨을 확인했습니다. 이러한 결과는 LLM의 지속적이고 자기 강화적인 진화를 가능하게 하는 반복적 피드백 루프의 타당성을 입증합니다.

English

Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.

ReMiT: 반복적 LLM 진화를 위한 강화 학습 기반 중간 훈련

ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

초록

Support