SRPO:通过反思感知强化学习提升多模态大语言模型的推理能力
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning
June 2, 2025
作者: Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, Yangfan He, Mi Zhang, Shen Yan
cs.AI
摘要
多模态大语言模型(MLLMs)在推理任务中展现出显著潜力,但在处理需要明确自我反思与自我修正的复杂问题时,仍显不足,尤其是在与单模态文本模型相比时。现有的反思方法过于简单,难以生成有意义且具指导性的反馈,因为预训练模型的推理能力和知识边界在初始训练阶段已基本固定。为应对这些挑战,我们提出了基于群体相对策略优化(GRPO)的多模态自我反思增强推理框架(SRPO),这是一个两阶段的反思感知强化学习(RL)框架,专门设计用于提升多模态大语言模型的推理能力。在第一阶段,我们在先进MLLM的指导下构建了一个高质量、聚焦反思的数据集,该数据集基于初始响应生成反思,以帮助策略模型学习推理与自我反思。在第二阶段,我们在GRPO框架内引入了一种新颖的奖励机制,鼓励简洁且认知上有意义的反思,同时避免冗余。通过在MathVista、MathVision、MathVerse及MMMU-Pro等多个多模态推理基准上的广泛实验,使用Qwen-2.5-VL-7B和Qwen-2.5-VL-32B模型,SRPO显著超越了现有最先进模型,在推理准确性和反思质量上均取得了显著提升。
English
Multimodal large language models (MLLMs) have shown promising capabilities in
reasoning tasks, yet still struggle with complex problems requiring explicit
self-reflection and self-correction, especially compared to their unimodal
text-based counterparts. Existing reflection methods are simplistic and
struggle to generate meaningful and instructive feedback, as the reasoning
ability and knowledge limits of pre-trained models are largely fixed during
initial training. To overcome these challenges, we propose Multimodal
Self-Reflection enhanced reasoning with Group Relative Policy Optimization
(SRPO), a two-stage reflection-aware reinforcement learning (RL) framework
explicitly designed to enhance multimodal LLM reasoning. In the first stage, we
construct a high-quality, reflection-focused dataset under the guidance of an
advanced MLLM, which generates reflections based on initial responses to help
the policy model learn both reasoning and self-reflection. In the second stage,
we introduce a novel reward mechanism within the GRPO framework that encourages
concise and cognitively meaningful reflection while avoiding redundancy.
Extensive experiments across multiple multimodal reasoning benchmarks,
including MathVista, MathVision, MathVerse, and MMMU-Pro, using Qwen-2.5-VL-7B
and Qwen-2.5-VL-32B demonstrate that SRPO significantly outperforms
state-of-the-art models, achieving notable improvements in both reasoning
accuracy and reflection quality.Summary
AI-Generated Summary