ChatPaper.aiChatPaper

MiMo:释放语言模型推理潜能——从预训练到后训练的全过程探索

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

May 12, 2025
作者: Xiaomi LLM-Core Team, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
cs.AI

摘要

我们推出MiMo-7B,这是一款专为推理任务打造的大型语言模型,在预训练和训练后阶段均进行了优化。在预训练过程中,我们改进了数据预处理流程,并采用三阶段数据混合策略,以增强基础模型的推理潜力。MiMo-7B-Base在25万亿个token上进行了预训练,并引入了多token预测目标,以提升性能并加快推理速度。在训练后阶段,我们精心构建了一个包含13万个可验证数学和编程问题的数据集,用于强化学习,整合了基于测试难度的代码奖励机制,以缓解稀疏奖励问题,并采用策略性数据重采样来稳定训练。广泛的评估表明,MiMo-7B-Base具备卓越的推理潜力,甚至超越了规模更大的32B模型。经过强化学习调优的最终模型MiMo-7B-RL,在数学、代码及通用推理任务上均表现出色,超越了OpenAI o1-mini的性能。模型检查点可在https://github.com/xiaomimimo/MiMo获取。
English
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

Summary

AI-Generated Summary

PDF525May 13, 2025