OctoThinker：トレーニング途中でのインセンティブが強化学習のスケーリングを促進する

要旨

異なる基盤言語モデルファミリー、例えばLlamaやQwenは、強化学習（RL）を用いたポストトレーニングにおいて、特に推論集約型タスクにおいて異なる振る舞いを示す。基盤言語モデルが強化学習に適している要因は何か？この問いに対する深い洞察を得ることは、次世代のRLスケーラブルな基盤モデルを開発する上で不可欠である。本研究では、代表的なモデルファミリーであるQwenとLlamaに焦点を当て、中間トレーニング戦略がRLダイナミクスをどのように形成するかを調査する。我々の研究は以下のことを明らかにした：(1) MegaMath-Web-Proのような高品質な数学コーパスは、基盤モデルとRLのパフォーマンスを大幅に向上させるが、既存の代替案（例：FineMath-4plus）はこれを達成できない；(2) QA形式のデータ、特に長い連鎖的思考（CoT）推論例を追加することでRLの結果が向上し、指示データがこの効果をさらに引き出す；(3) 長いCoTは推論の深さを向上させるが、モデル応答の冗長性やRLトレーニングの不安定性を引き起こす可能性もあり、データフォーマットの重要性を強調する；(4) 中間トレーニングのスケーリングは、一貫して下流のRLパフォーマンスを強化する。これらの洞察に基づき、我々は2段階の中間トレーニング戦略「Stable-then-Decay」を導入する。この戦略では、基盤モデルはまず200Bトークンに対して一定の学習率でトレーニングされ、その後20Bトークンに対して3つのCoTに焦点を当てたブランチで学習率を減衰させながらトレーニングされる。これにより、RL互換性が強く、RLに適したモデルファミリー（例：Qwen）との性能差を縮めるOctoThinkerモデルファミリーが得られる。我々の研究が、RL時代における基盤モデルの事前トレーニング戦略を形成する一助となることを願っている。さらなる研究を支援するため、我々はオープンソースモデルと70Bトークンを超える数学推論集約型コーパス（例：MegaMath-Web-Pro-Max）を公開する。

English

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

OctoThinker：トレーニング途中でのインセンティブが強化学習のスケーリングを促進する

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

要旨

Support