옥토싱커: 중간 학습 인센티브가 강화 학습 확장을 촉진한다

초록

Llama와 Qwen과 같은 서로 다른 기본 언어 모델 패밀리는 강화 학습(RL)을 통한 사후 훈련 과정에서, 특히 추론 집약적인 작업에서 상이한 행동을 보인다. 어떤 기본 언어 모델이 강화 학습에 적합한가? 이 질문에 대한 깊은 이해는 차세대 RL 확장 가능한 기반 모델 개발에 필수적이다. 본 연구에서는 Qwen과 Llama라는 두 대표적인 모델 패밀리에 초점을 맞추어 중간 훈련 전략이 RL 동역학에 어떻게 영향을 미치는지 조사한다. 우리의 연구는 다음과 같은 사실을 밝혀냈다: (1) MegaMath-Web-Pro와 같은 고품질 수학 코퍼스는 기본 모델과 RL 성능을 모두 크게 향상시키는 반면, 기존의 대안들(예: FineMath-4plus)은 이를 달성하지 못한다; (2) QA 스타일 데이터, 특히 긴 사고 연쇄(CoT) 추론 예제를 추가하면 RL 결과가 더욱 개선되며, 명령어 데이터는 이 효과를 더욱 강화한다; (3) 긴 CoT는 추론 깊이를 향상시키지만, 모델 응답의 장황함과 RL 훈련의 불안정성을 유발할 수 있어 데이터 포맷팅의 중요성을 강조한다; (4) 중간 훈련 규모를 확장하면 하류 RL 성능이 지속적으로 강화된다. 이러한 통찰을 바탕으로, 우리는 두 단계의 중간 훈련 전략인 Stable-then-Decay를 제안한다. 이 전략에서는 기본 모델을 먼저 200B 토큰에 대해 일정한 학습률로 훈련한 후, 학습률 감소를 적용하여 세 개의 CoT 중심 분기에서 20B 토큰을 추가로 훈련한다. 이를 통해 RL 호환성이 강하고 RL 친화적인 모델 패밀리(예: Qwen)와의 성능 격차를 줄인 OctoThinker 모델 패밀리를 개발했다. 우리의 연구가 RL 시대의 기반 모델을 위한 사전 훈련 전략을 형성하는 데 도움이 되기를 바란다. 추가 연구를 지원하기 위해, 우리는 오픈소스 모델과 700억 토큰 이상의 정제된 수학 추론 집약적 코퍼스(예: MegaMath-Web-Pro-Max)를 공개한다.

English

Different base language model families, such as Llama and Qwen, exhibit divergent behaviors during post-training with reinforcement learning (RL), especially on reasoning-intensive tasks. What makes a base language model suitable for reinforcement learning? Gaining deeper insight into this question is essential for developing RL-scalable foundation models of the next generation. In this work, we investigate how mid-training strategies shape RL dynamics, focusing on two representative model families: Qwen and Llama. Our study reveals that (1) high-quality mathematical corpora, such as MegaMath-Web-Pro, significantly improve both base model and RL performance, while existing alternatives (e.g., FineMath-4plus) fail to do so; (2) further adding QA-style data, particularly long chain-of-thought (CoT) reasoning examples, enhances RL outcomes, and instruction data further unlocks this effect; (3) while long-CoT improves reasoning depth, it can also induce verbosity of model responses and unstability of RL training, underscoring the importance of data formatting; (4) scaling mid-training consistently leads to stronger downstream RL performance. Building on these insights, we introduce a two-stage mid-training strategy, Stable-then-Decay, in which base models are first trained on 200B tokens with a constant learning rate, followed by 20B tokens across three CoT-focused branches with learning rate decay. This yields OctoThinker, a family of models demonstrating strong RL compatibility and closing the performance gap with more RL-friendly model families, i.e., Qwen. We hope our work will help shape pre-training strategies for foundation models in the RL era. To support further research, we release our open-source models along with a curated math reasoning-intensive corpus of over 70 billion tokens (i.e., MegaMath-Web-Pro-Max).

옥토싱커: 중간 학습 인센티브가 강화 학습 확장을 촉진한다

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

초록

Support