MM-Eureka:基於規則的大規模強化學習探索視覺頓悟時刻
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning
March 10, 2025
作者: Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao
cs.AI
摘要
我們介紹了MM-Eureka,這是一個多模態推理模型,成功將大規模基於規則的強化學習(RL)擴展至多模態推理領域。儘管基於規則的RL在提升大型語言模型(LLMs)於文本領域的推理能力方面已展現出顯著成效,但其在多模態環境中的應用一直面臨挑戰。我們的工作在多模態空間中重現了如DeepSeek-R1等基於文本的RL系統的關鍵特徵,包括準確性獎勵和回應長度的穩步提升,以及反思行為的出現。我們證明,無論是指令微調還是預訓練模型,都能通過基於規則的RL發展出強大的多模態推理能力,無需監督微調,並展現出相較於其他方法更優的數據效率。我們開源了完整的流程,以促進該領域的進一步研究。所有代碼、模型、數據等均已發佈於https://github.com/ModalMinds/MM-EUREKA。
English
We present MM-Eureka, a multimodal reasoning model that successfully extends
large-scale rule-based reinforcement learning (RL) to multimodal reasoning.
While rule-based RL has shown remarkable success in improving LLMs' reasoning
abilities in text domains, its application to multimodal settings has remained
challenging. Our work reproduces key characteristics of text-based RL systems
like DeepSeek-R1 in the multimodal space, including steady increases in
accuracy reward and response length, and the emergence of reflection behaviors.
We demonstrate that both instruction-tuned and pre-trained models can develop
strong multimodal reasoning capabilities through rule-based RL without
supervised fine-tuning, showing superior data efficiency compared to
alternative approaches. We open-source our complete pipeline to foster further
research in this area. We release all our codes, models, data, etc. at
https://github.com/ModalMinds/MM-EUREKASummary
AI-Generated Summary