SimpleRL-Zoo: 野生環境におけるオープンベースモデルのためのゼロ強化学習の調査と制御

要旨

DeepSeek-R1は、ルールベースの報酬を用いたシンプルな強化学習（RL）フレームワークを通じて、長い連鎖的思考（CoT）推論が自然に出現することを示しました。このトレーニングはベースモデルから直接開始できるため、ゼロRLトレーニングと呼ばれるパラダイムです。最近のゼロRLトレーニングの再現努力は主にQwen2.5モデルシリーズに焦点を当てていますが、ベースモデルが既に強力な指示追従能力と自己反省能力を示しているため、代表的な例とは言えません。本研究では、LLama3-8B、Mistral-7B/24B、DeepSeek-Math-7B、Qwen2.5-math-7B、および0.5Bから32BまでのすべてのQwen2.5モデルを含む、10種類の多様なベースモデルにわたるゼロRLトレーニングを調査しました。フォーマット報酬の調整やクエリの難易度制御などの重要な設計戦略を活用することで、ほとんどの設定において推論精度と応答長の大幅な改善を達成しました。しかし、トレーニングダイナミクスを注意深く監視することで、異なるベースモデルがトレーニング中に異なるパターンを示すことが観察されました。例えば、応答長の増加が必ずしも検証（いわゆる「アハ体験」）などの特定の認知行動の出現と相関するわけではありませんでした。特に、Qwenファミリー以外の小さなモデルで初めて「アハ体験」を観察しました。成功したゼロRLトレーニングを可能にする主要な設計と、その発見と実践を共有します。さらなる研究を促進するため、コード、モデル、および分析ツールをオープンソース化しました。

English

DeepSeek-R1 has shown that long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning (RL) framework with rule-based rewards, where the training may directly start from the base models-a paradigm referred to as zero RL training. Most recent efforts to reproduce zero RL training have primarily focused on the Qwen2.5 model series, which may not be representative as we find the base models already exhibit strong instruction-following and self-reflection abilities. In this work, we investigate zero RL training across 10 diverse base models, spanning different families and sizes including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B. Leveraging several key design strategies-such as adjusting format reward and controlling query difficulty-we achieve substantial improvements in both reasoning accuracy and response length across most settings. However, by carefully monitoring the training dynamics, we observe that different base models exhibit distinct patterns during training. For instance, the increased response length does not always correlate with the emergence of certain cognitive behaviors such as verification (i.e., the "aha moment"). Notably, we observe the "aha moment" for the first time in small models not from the Qwen family. We share the key designs that enable successful zero RL training, along with our findings and practices. To facilitate further research, we open-source the code, models, and analysis tools.

SimpleRL-Zoo: 野生環境におけるオープンベースモデルのためのゼロ強化学習の調査と制御

SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild

要旨

Support