MM-Eureka：基於規則的大規模強化學習探索視覺頓悟時刻

摘要

我們介紹了MM-Eureka，這是一個多模態推理模型，成功將大規模基於規則的強化學習（RL）擴展至多模態推理領域。儘管基於規則的RL在提升大型語言模型（LLMs）於文本領域的推理能力方面已展現出顯著成效，但其在多模態環境中的應用一直面臨挑戰。我們的工作在多模態空間中重現了如DeepSeek-R1等基於文本的RL系統的關鍵特徵，包括準確性獎勵和回應長度的穩步提升，以及反思行為的出現。我們證明，無論是指令微調還是預訓練模型，都能通過基於規則的RL發展出強大的多模態推理能力，無需監督微調，並展現出相較於其他方法更優的數據效率。我們開源了完整的流程，以促進該領域的進一步研究。所有代碼、模型、數據等均已發佈於https://github.com/ModalMinds/MM-EUREKA。

English

We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

MM-Eureka：基於規則的大規模強化學習探索視覺頓悟時刻

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

摘要

Support