強化学習中期訓練

要旨

最先端の大規模言語モデルの開発は、一般的に事前学習と事後学習の2段階プロセスとして理解されています。本論文では、強力な性能向上の可能性を秘めた中間段階として、強化学習を活用した中間学習（Reinforcement Mid-Training）の必要性を指摘します。本論文ではこの問題を正式に定義し、以下の3つの主要な課題を特定します：(1)過剰な推論ステップによる非効率な学習、(2)トークンエントロピー分布の不均衡の無視、(3)トークン情報の活用不足。これらの課題に対処するため、我々はRMT（Reinforcement Mid-Training）フレームワークを提案します。これは効率的で適応的かつ統合的な中間学習を実現するための様々な革新的なコンポーネントを備えています。具体的には、まず不要な推論ステップを制約しモデルの過剰思考を緩和する動的トークンバジェット機構を導入します。次に、易しいトークンから難しいトークンへと段階的に学習を進めるカリキュラムベースの適応的サンプリング手法を設計します。最後に、強化学習と次トークン予測を組み合わせたデュアルトレーニング戦略を提示し、重要なトークンに焦点を当てた学習と全てのトークン情報の完全な活用を保証します。大規模な実験により、RMTが最先端の手法を上回り、言語モデリングにおいて推論長を21%に抑えながら最大+64.91%の性能向上を達成することを実証しました。また、中間学習後のチェックポイントがその後の事後学習に有益であり、数学領域で最大+18.76%の改善をもたらすことも示しました。

English

The development of state-of-the-art large language models is commonly understood as a two-stage process involving pre-training and post-training. We point out the need for an additional intermediate stage called reinforcement mid-training with potential for strong performance gains. In this paper, we formally define the problem and identify three key challenges: (1) inefficient training due to excessive reasoning steps, (2) disregard of the imbalanced token entropy distribution, and (3) underutilization of token information. To address these challenges, we propose RMT, a framework for efficient, adaptive, and unified reinforcement mid-training with various innovative components. In particular, we first introduce a dynamic token budget mechanism that constrains unnecessary reasoning steps and mitigates model overthinking. Next, we design a curriculum-based adaptive sampling method that fosters a progressive learning trajectory from easy to hard tokens. Finally, we present a dual training strategy that combines reinforcement learning with next-token prediction, ensuring targeted learning on key tokens and full exploitation of all token information. Extensive experiments demonstrate the superiority of RMT over state-of-the-art methods, achieving up to +64.91% performance improvement with only 21% of the reasoning length in language modeling. We also show that checkpoints obtained after reinforcement mid-training can benefit the subsequent post-training, yielding up to +18.76% improvement in the mathematical domain.