SRPO：通过反射感知强化学习增强多模态大语言模型的推理能力

摘要

多模態大型語言模型（MLLMs）在推理任務中展現出顯著的潛力，然而在處理需要明確自我反思與自我修正的複雜問題時，仍顯不足，尤其是與單模態文本模型相比。現有的反思方法過於簡化，難以生成有意義且具指導性的反饋，因為預訓練模型的推理能力與知識範圍在初始訓練階段已大體固定。為克服這些挑戰，我們提出了基於群體相對策略優化（GRPO）的多模態自我反思增強推理（SRPO），這是一個兩階段的反思感知強化學習（RL）框架，專門設計用於提升多模態LLM的推理能力。在第一階段，我們在高級MLLM的指導下構建了一個高質量、以反思為核心的數據集，該數據集基於初始回應生成反思，以幫助策略模型學習推理與自我反思。在第二階段，我們在GRPO框架內引入了一種新穎的獎勵機制，該機制鼓勵簡潔且認知意義豐富的反思，同時避免冗餘。通過在多個多模態推理基準（包括MathVista、MathVision、MathVerse和MMMU-Pro）上使用Qwen-2.5-VL-7B和Qwen-2.5-VL-32B進行的大量實驗表明，SRPO在推理準確性和反思質量上均顯著優於現有最先進的模型，取得了顯著的改進。

English

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

SRPO：通过反射感知强化学习增强多模态大语言模型的推理能力

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

摘要

Support