MiMo: 언어 모델의 추론 능력 개방 - 사전 학습에서 사후 학습까지

초록

우리는 추론 작업을 위해 태어난 대규모 언어 모델인 MiMo-7B를 소개하며, 사전 학습과 사후 학습 단계 모두에서 최적화를 진행했습니다. 사전 학습 단계에서는 데이터 전처리 파이프라인을 강화하고, 세 단계의 데이터 혼합 전략을 사용하여 기본 모델의 추론 잠재력을 강화했습니다. MiMo-7B-Base는 25조 개의 토큰으로 사전 학습되었으며, 성능 향상과 추론 속도 가속을 위해 추가적인 다중 토큰 예측 목표를 적용했습니다. 사후 학습 단계에서는 검증 가능한 13만 개의 수학 및 프로그래밍 문제 데이터셋을 강화 학습에 활용하고, 테스트 난이도 기반의 코드 보상 체계를 통합하여 희소 보상 문제를 완화하며, 전략적 데이터 리샘플링을 통해 학습 안정성을 확보했습니다. 광범위한 평가 결과, MiMo-7B-Base는 탁월한 추론 잠재력을 보유하며, 훨씬 더 큰 32B 모델을 능가하는 성능을 보였습니다. 최종 강화 학습 튜닝 모델인 MiMo-7B-RL은 수학, 코드 및 일반 추론 작업에서 우수한 성능을 달성하여 OpenAI o1-mini의 성능을 뛰어넘었습니다. 모델 체크포인트는 https://github.com/xiaomimimo/MiMo에서 확인할 수 있습니다.

English

We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at https://github.com/xiaomimimo/MiMo.

MiMo: 언어 모델의 추론 능력 개방 - 사전 학습에서 사후 학습까지

MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

초록

Support