MiMo-VL技術報告

摘要

我們開源了MiMo-VL-7B-SFT和MiMo-VL-7B-RL，這兩款強大的視覺語言模型在通用視覺理解和多模態推理方面均展現了頂尖性能。MiMo-VL-7B-RL在40項評估任務中的35項上超越了Qwen2.5-VL-7B，並在OlympiadBench上取得了59.4分，超越了參數高達78B的模型。在GUI基礎應用方面，它以56.1分在OSWorld-G上設立了新標準，甚至超越了如UI-TARS等專用模型。我們的訓練結合了四階段預訓練（2.4萬億標記）與混合在線強化學習（MORL），整合了多樣化的獎勵信號。我們發現，在預訓練階段融入高質量推理數據及長鏈思維的重要性，以及混合RL在同步多領域優化挑戰中的益處。此外，我們貢獻了一個涵蓋50多項任務的全面評估套件，以促進可重複性並推動該領域的發展。模型檢查點及完整評估套件可在https://github.com/XiaomiMiMo/MiMo-VL獲取。

English

We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning. MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35 out of 40 evaluated tasks, and scores 59.4 on OlympiadBench, surpassing models with up to 78B parameters. For GUI grounding applications, it sets a new standard with 56.1 on OSWorld-G, even outperforming specialized models such as UI-TARS. Our training combines four-stage pre-training (2.4 trillion tokens) with Mixed On-policy Reinforcement Learning (MORL) integrating diverse reward signals. We identify the importance of incorporating high-quality reasoning data with long Chain-of-Thought into pre-training stages, and the benefits of mixed RL despite challenges in simultaneous multi-domain optimization. We also contribute a comprehensive evaluation suite covering 50+ tasks to promote reproducibility and advance the field. The model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-VL.

MiMo-VL技術報告

MiMo-VL Technical Report

摘要

Support