ChatPaper.aiChatPaper

MiroMind-M1:通过上下文感知多阶段策略优化推进数学推理的开源成果

MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

July 19, 2025
作者: Xingxuan Li, Yao Xiao, Dianwen Ng, Hai Ye, Yue Deng, Xiang Lin, Bin Wang, Zhanfeng Mo, Chong Zhang, Yueyi Zhang, Zonglin Yang, Ruilin Li, Lei Lei, Shihao Xu, Han Zhao, Weiling Chen, Feng Ji, Lidong Bing
cs.AI

摘要

大型语言模型近期已从流畅文本生成演进至跨领域的高级推理,催生了推理语言模型。在这些领域中,数学推理作为代表性基准,因其需要精确的多步逻辑和抽象推理能力,可推广至其他任务。尽管如GPT-3等闭源推理语言模型展现了卓越的推理能力,但其专有性质限制了透明度和可复现性。虽然众多开源项目旨在填补这一差距,但多数因缺失关键资源如数据集和详细训练配置而开放不足,阻碍了可复现性。为促进推理语言模型开发的更高透明度,我们推出了MiroMind-M1系列,这是一套基于Qwen-2.5架构的完全开源推理语言模型,其性能达到或超越了现有开源模型。具体而言,我们的模型采用两阶段训练:首先在精心筛选的71.9万道数学推理问题及已验证的思维链轨迹上进行监督微调(SFT),随后在6.2万道具有挑战性且可验证的问题上进行强化学习与验证(RLVR)。为增强RLVR过程的鲁棒性和效率,我们引入了上下文感知多阶段策略优化算法,该算法结合了长度渐进式训练与自适应重复惩罚机制,以促进上下文感知的强化学习训练。我们的模型在AIME24、AIME25及MATH基准测试中,基于Qwen-2.5的开源7B和32B模型均实现了领先或竞争性的性能,并展现出卓越的token效率。为便于复现,我们完整发布了模型(MiroMind-M1-SFT-7B、MiroMind-M1-RL-7B、MiroMind-M1-RL-32B)、数据集(MiroMind-M1-SFT-719K、MiroMind-M1-RL-62K)以及所有训练与评估配置。我们期望这些资源能支持进一步研究,推动社区进步。
English
Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.
PDF1061July 22, 2025