MiMo-VL 기술 보고서

초록

우리는 일반적인 시각 이해와 멀티모달 추론 모두에서 최첨단 성능을 제공하는 두 가지 강력한 비전-언어 모델인 MiMo-VL-7B-SFT와 MiMo-VL-7B-RL을 오픈소스로 공개합니다. MiMo-VL-7B-RL은 평가된 40개 작업 중 35개에서 Qwen2.5-VL-7B를 능가하며, OlympiadBench에서 59.4점을 기록하여 최대 78B 파라미터 규모의 모델들도 뛰어넘었습니다. GUI 기반 응용 프로그램에서는 OSWorld-G에서 56.1점으로 새로운 기준을 세웠으며, UI-TARS와 같은 특화된 모델들까지도 능가했습니다. 우리의 훈련은 2.4조 토큰의 4단계 사전 훈련과 다양한 보상 신호를 통합한 Mixed On-policy Reinforcement Learning(MORL)을 결합합니다. 우리는 사전 훈련 단계에서 고품질 추론 데이터와 긴 Chain-of-Thought를 포함하는 것의 중요성과, 동시 다중 도메인 최적화의 어려움에도 불구하고 혼합 RL의 이점을 확인했습니다. 또한 재현성을 촉진하고 해당 분야를 발전시키기 위해 50개 이상의 작업을 아우르는 포괄적인 평가 스위트를 제공합니다. 모델 체크포인트와 전체 평가 스위트는 https://github.com/XiaomiMiMo/MiMo-VL에서 확인할 수 있습니다.

English

We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.

MiMo-VL 기술 보고서

MiMo-VL Technical Report

초록

Support