SRPO：通过反思感知强化学习提升多模态大语言模型的推理能力

摘要

多模态大语言模型（MLLMs）在推理任务中展现出显著潜力，但在处理需要明确自我反思与自我修正的复杂问题时，仍显不足，尤其是在与单模态文本模型相比时。现有的反思方法过于简单，难以生成有意义且具指导性的反馈，因为预训练模型的推理能力和知识边界在初始训练阶段已基本固定。为应对这些挑战，我们提出了基于群体相对策略优化（GRPO）的多模态自我反思增强推理框架（SRPO），这是一个两阶段的反思感知强化学习（RL）框架，专门设计用于提升多模态大语言模型的推理能力。在第一阶段，我们在先进MLLM的指导下构建了一个高质量、聚焦反思的数据集，该数据集基于初始响应生成反思，以帮助策略模型学习推理与自我反思。在第二阶段，我们在GRPO框架内引入了一种新颖的奖励机制，鼓励简洁且认知上有意义的反思，同时避免冗余。通过在MathVista、MathVision、MathVerse及MMMU-Pro等多个多模态推理基准上的广泛实验，使用Qwen-2.5-VL-7B和Qwen-2.5-VL-32B模型，SRPO显著超越了现有最先进模型，在推理准确性和反思质量上均取得了显著提升。

English

Multimodal large language models (MLLMs) have shown promising capabilities in reasoning tasks, yet still struggle with complex problems requiring explicit self-reflection and self-correction, especially compared to their unimodal text-based counterparts. Existing reflection methods are simplistic and struggle to generate meaningful and instructive feedback, as the reasoning ability and knowledge limits of pre-trained models are largely fixed during initial training. To overcome these challenges, we propose Multimodal Self-Reflection enhanced reasoning with Group Relative Policy Optimization (SRPO), a two-stage reflection-aware reinforcement learning (RL) framework explicitly designed to enhance multimodal LLM reasoning. In the first stage, we construct a high-quality, reflection-focused dataset under the guidance of an advanced MLLM, which generates reflections based on initial responses to help the policy model learn both reasoning and self-reflection. In the second stage, we introduce a novel reward mechanism within the GRPO framework that encourages concise and cognitively meaningful reflection while avoiding redundancy. Extensive experiments across multiple multimodal reasoning benchmarks, including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms state-of-the-art models, achieving notable improvements in both reasoning accuracy and reflection quality.

SRPO：通过反思感知强化学习提升多模态大语言模型的推理能力

SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning

摘要

Support