MiroMind-M1：文脈を考慮した多段階ポリシー最適化による数学的推論のオープンソース進化

要旨

大規模言語モデルは最近、流暢なテキスト生成から多様な領域にわたる高度な推論へと進化し、推論言語モデル（RLM）が登場しました。これらの領域の中でも、数学的推論は代表的なベンチマークとして機能します。なぜなら、正確な多段階の論理と抽象的な推論を必要とし、他のタスクにも一般化可能だからです。GPT-3のようなクローズドソースのRLMは印象的な推論能力を示しますが、そのプロプライエタリな性質が透明性と再現性を制限しています。多くのオープンソースプロジェクトがこのギャップを埋めようとしていますが、データセットや詳細なトレーニング設定などの重要なリソースを省略しているため、再現性が妨げられています。RLM開発の透明性を高めるために、私たちはQwen-2.5を基盤とした完全オープンソースのRLMシリーズであるMiroMind-M1シリーズを紹介します。このシリーズは既存のオープンソースRLMの性能に匹敵またはそれを上回ります。具体的には、私たちのモデルは2段階でトレーニングされます。まず、719Kの数学的推論問題と検証済みのCoT（Chain-of-Thought）軌跡を含む慎重に選ばれたコーパスでのSFT（Supervised Fine-Tuning）を行い、次に62Kの挑戦的で検証可能な問題でのRLVR（Reinforcement Learning with Verifiable Reasoning）を行います。RLVRプロセスの堅牢性と効率を向上させるために、Context-Aware Multi-Stage Policy Optimizationというアルゴリズムを導入しました。このアルゴリズムは、長さに応じた段階的トレーニングと適応的な繰り返しペナルティを統合し、コンテキストを意識したRLトレーニングを促進します。私たちのモデルは、AIME24、AIME25、MATHベンチマークにおいて、Qwen-2.5ベースのオープンソース7Bおよび32Bモデルの中で最先端または競争力のある性能と優れたトークン効率を達成しました。再現性を促進するために、完全なスタックを公開します。これには、モデル（MiroMind-M1-SFT-7B、MiroMind-M1-RL-7B、MiroMind-M1-RL-32B）、データセット（MiroMind-M1-SFT-719K、MiroMind-M1-RL-62K）、およびすべてのトレーニングと評価の設定が含まれます。これらのリソースがさらなる研究を支援し、コミュニティの進歩を促進することを願っています。

English

Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

MiroMind-M1：文脈を考慮した多段階ポリシー最適化による数学的推論のオープンソース進化

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

要旨

Support