ChatPaper.aiChatPaper

SRPO:通过反射感知强化学习增强多模态大语言模型的推理能力

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

June 2, 2025
作者: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan
cs.AI

摘要

多模態大型語言模型(MLLMs)在推理任務中展現出顯著的潛力,然而在處理需要明確自我反思與自我修正的複雜問題時,仍顯不足,尤其是與單模態文本模型相比。現有的反思方法過於簡化,難以生成有意義且具指導性的反饋,因為預訓練模型的推理能力與知識範圍在初始訓練階段已大體固定。為克服這些挑戰,我們提出了基於群體相對策略優化(GRPO)的多模態自我反思增強推理(SRPO),這是一個兩階段的反思感知強化學習(RL)框架,專門設計用於提升多模態LLM的推理能力。在第一階段,我們在高級MLLM的指導下構建了一個高質量、以反思為核心的數據集,該數據集基於初始回應生成反思,以幫助策略模型學習推理與自我反思。在第二階段,我們在GRPO框架內引入了一種新穎的獎勵機制,該機制鼓勵簡潔且認知意義豐富的反思,同時避免冗餘。通過在多個多模態推理基準(包括MathVista、MathVision、MathVerse和MMMU-Pro)上使用Qwen-2.5-VL-7B和Qwen-2.5-VL-32B進行的大量實驗表明,SRPO在推理準確性和反思質量上均顯著優於現有最先進的模型,取得了顯著的改進。
English
Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.
PDF302June 3, 2025