R1-ShareVL：通过共享梯度奖励策略优化激励多模态大语言模型的推理能力

摘要

本研究旨在通过强化学习（RL）激发多模态大语言模型（MLLMs）的推理能力，并开发一种有效方法以缓解RL过程中的稀疏奖励与优势消失问题。为此，我们提出了Share-GRPO，一种新颖的RL方法，该方法通过在扩展的问题空间中探索并共享多样化的推理轨迹来应对上述挑战。具体而言，Share-GRPO首先利用数据转换技术为给定问题扩展问题空间，随后鼓励MLLM在扩展后的问题空间中有效探索多样化的推理轨迹，并在RL过程中跨扩展问题共享所发现的推理轨迹。此外，Share-GRPO还在优势计算过程中共享奖励信息，分层估计问题变体间及变体内的解决方案优势，从而实现更准确的相对优势估计，并提升策略训练的稳定性。在六个广泛使用的推理基准上的大量评估验证了我们方法的卓越性能。代码将发布于https://github.com/HJYao00/R1-ShareVL。

English

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

R1-ShareVL：通过共享梯度奖励策略优化激励多模态大语言模型的推理能力

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

摘要

Support