강화 중간 훈련

초록

최첨단 대규모 언어 모델의 개발은 일반적으로 사전 학습과 사후 학습의 두 단계로 이루어지는 과정으로 이해됩니다. 우리는 강화 중간 학습이라는 추가적인 중간 단계가 강력한 성능 향상의 잠재력을 가지고 있음을 지적합니다. 본 논문에서는 이 문제를 공식적으로 정의하고 세 가지 주요 과제를 식별합니다: (1) 과도한 추론 단계로 인한 비효율적인 학습, (2) 불균형적인 토큰 엔트로피 분포의 무시, (3) 토큰 정보의 미흡한 활용. 이러한 과제를 해결하기 위해, 우리는 다양한 혁신적인 구성 요소를 포함한 효율적이고 적응적이며 통합된 강화 중간 학습 프레임워크인 RMT를 제안합니다. 특히, 우리는 먼저 불필요한 추론 단계를 제한하고 모델의 과도한 사고를 완화하는 동적 토큰 예산 메커니즘을 소개합니다. 다음으로, 쉬운 토큰에서 어려운 토큰으로의 점진적인 학습 경로를 조성하는 커리큘럼 기반 적응 샘플링 방법을 설계합니다. 마지막으로, 강화 학습과 다음 토큰 예측을 결합한 이중 학습 전략을 제시하여 주요 토큰에 대한 목표 학습과 모든 토큰 정보의 완전한 활용을 보장합니다. 광범위한 실험을 통해 RMT가 최첨단 방법들을 능가하며, 언어 모델링에서 추론 길이의 21%만으로 최대 +64.91%의 성능 향상을 달성함을 입증합니다. 또한, 강화 중간 학습 후 얻은 체크포인트가 후속 사후 학습에 도움을 주어 수학적 영역에서 최대 +18.76%의 향상을 가져올 수 있음을 보여줍니다.

English

The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.

강화 중간 훈련

Reinforcement Mid-Training

초록

Support