MiroMind-M1:基於上下文感知多階段策略優化的數學推理開源進展
MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization
July 19, 2025
作者: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing
cs.AI
摘要
大型語言模型近期已從流暢的文本生成,進化至跨多領域的高級推理能力,催生了推理語言模型。在這些領域中,數學推理作為一個代表性基準,因其需要精確的多步邏輯與抽象推理,這些能力可泛化至其他任務。儘管如GPT-3等閉源RLM展現了令人印象深刻的推理能力,但其專有性質限制了透明度與可複現性。雖然許多開源項目旨在彌補這一差距,但大多數因省略了關鍵資源如數據集及詳細訓練配置,而缺乏足夠的開放性,阻礙了可複現性。為促進RLM開發的更高透明度,我們推出了MiroMind-M1系列,這是一組基於Qwen-2.5架構構建的完全開源RLM,其性能匹配或超越了現有開源RLM。具體而言,我們的模型分兩階段訓練:首先在精心挑選的719K數學推理問題語料庫上進行SFT(監督微調),這些問題附帶經過驗證的CoT(思維鏈)軌跡;隨後在62K具有挑戰性且可驗證的問題上進行RLVR(強化學習與驗證推理)。為增強RLVR過程的魯棒性與效率,我們引入了上下文感知多階段策略優化算法,該算法整合了長度漸進式訓練與自適應重複懲罰,以鼓勵上下文感知的RL訓練。我們的模型在AIME24、AIME25及MATH基準測試中,基於Qwen-2.5的開源7B與32B模型中,達到了業界領先或競爭力的性能,並展現出卓越的token效率。為促進可複現性,我們發布了完整的技術棧:包括模型(MiroMind-M1-SFT-7B、MiroMind-M1-RL-7B、MiroMind-M1-RL-32B)、數據集(MiroMind-M1-SFT-719K、MiroMind-M1-RL-62K)以及所有訓練與評估配置。我們希望這些資源能支持進一步研究,並推動社區的進步。
English
Large language models have recently evolved from fluent text generation to
advanced reasoning across diverse domains, giving rise to reasoning language
models. Among these domains, mathematical reasoning serves as a representative
benchmark as it requires precise multi-step logic and abstract reasoning, which
can be generalized to other tasks. While closed-source RLMs such as GPT-o3
demonstrate impressive reasoning capabilities, their proprietary nature limits
transparency and reproducibility. Although many open-source projects aim to
close this gap, most of them lack sufficient openness by omitting critical
resources such as datasets and detailed training configurations, which hinders
reproducibility. To contribute toward greater transparency in RLM development,
we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on
the Qwen-2.5 backbone that match or exceed the performance of existing
open-source RLMs. Specifically, our models are trained in two stages: SFT on a
carefully curated corpus of 719K math-reasoning problems with verified CoT
trajectories, followed by RLVR on 62K challenging and verifiable problems. To
enhance the robustness and efficiency of the RLVR process, we introduce
Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates
length-progressive training with an adaptive repetition penalty to encourage
context-aware RL training. Our model achieves state-of-the-art or competitive
performance and superior token efficiency among Qwen-2.5-based open-source 7B
and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate
reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B,
MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K,
MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope
these resources will support further research and foster community advancement.