R1-ShareVL:通过共享梯度奖励策略优化激励多模态大语言模型的推理能力
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
May 22, 2025
作者: Huanjin Yao, Qixiang Yin, Jingyi Zhang, Min Yang, Yibo Wang, Wenhao Wu, Fei Su, Li Shen, Minghui Qiu, Dacheng Tao, Jiaxing Huang
cs.AI
摘要
本研究旨在通过强化学习(RL)激发多模态大语言模型(MLLMs)的推理能力,并开发一种有效方法以缓解RL过程中的稀疏奖励与优势消失问题。为此,我们提出了Share-GRPO,一种新颖的RL方法,该方法通过在扩展的问题空间中探索并共享多样化的推理轨迹来应对上述挑战。具体而言,Share-GRPO首先利用数据转换技术为给定问题扩展问题空间,随后鼓励MLLM在扩展后的问题空间中有效探索多样化的推理轨迹,并在RL过程中跨扩展问题共享所发现的推理轨迹。此外,Share-GRPO还在优势计算过程中共享奖励信息,分层估计问题变体间及变体内的解决方案优势,从而实现更准确的相对优势估计,并提升策略训练的稳定性。在六个广泛使用的推理基准上的大量评估验证了我们方法的卓越性能。代码将发布于https://github.com/HJYao00/R1-ShareVL。
English
In this work, we aim to incentivize the reasoning ability of Multimodal Large
Language Models (MLLMs) via reinforcement learning (RL) and develop an
effective approach that mitigates the sparse reward and advantage vanishing
issues during RL. To this end, we propose Share-GRPO, a novel RL approach that
tackle these issues by exploring and sharing diverse reasoning trajectories
over expanded question space. Specifically, Share-GRPO first expands the
question space for a given question via data transformation techniques, and
then encourages MLLM to effectively explore diverse reasoning trajectories over
the expanded question space and shares the discovered reasoning trajectories
across the expanded questions during RL. In addition, Share-GRPO also shares
reward information during advantage computation, which estimates solution
advantages hierarchically across and within question variants, allowing more
accurate estimation of relative advantages and improving the stability of
policy training. Extensive evaluations over six widely-used reasoning
benchmarks showcase the superior performance of our method. Code will be
available at https://github.com/HJYao00/R1-ShareVL.Summary
AI-Generated Summary