ChatPaper.aiChatPaper

MM-Eureka:基於規則的大規模強化學習探索視覺頓悟時刻

MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning

March 10, 2025
作者: Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, Ping Luo, Yu Qiao, Qiaosheng Zhang, Wenqi Shao
cs.AI

摘要

我們介紹了MM-Eureka,這是一個多模態推理模型,成功將大規模基於規則的強化學習(RL)擴展至多模態推理領域。儘管基於規則的RL在提升大型語言模型(LLMs)於文本領域的推理能力方面已展現出顯著成效,但其在多模態環境中的應用一直面臨挑戰。我們的工作在多模態空間中重現了如DeepSeek-R1等基於文本的RL系統的關鍵特徵,包括準確性獎勵和回應長度的穩步提升,以及反思行為的出現。我們證明,無論是指令微調還是預訓練模型,都能通過基於規則的RL發展出強大的多模態推理能力,無需監督微調,並展現出相較於其他方法更優的數據效率。我們開源了完整的流程,以促進該領域的進一步研究。所有代碼、模型、數據等均已發佈於https://github.com/ModalMinds/MM-EUREKA。
English
We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning. While rule-based RL has shown remarkable success in improving LLMs' reasoning abilities in text domains, its application to multimodal settings has remained challenging. Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space, including steady increases in accuracy reward and response length, and the emergence of reflection behaviors. We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning, showing superior data efficiency compared to alternative approaches. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

Summary

AI-Generated Summary

PDF602March 11, 2025