R1-ShareVL：通过Share-GRPO激励多模态大语言模型的推理能力

摘要

在本研究中，我們旨在通過強化學習（RL）激發多模態大型語言模型（MLLMs）的推理能力，並開發一種有效的方法來緩解RL過程中獎勵稀疏和優勢消失的問題。為此，我們提出了Share-GRPO，這是一種新穎的RL方法，通過在擴展的問題空間中探索和共享多樣化的推理軌跡來解決這些問題。具體而言，Share-GRPO首先通過數據轉換技術擴展給定問題的問題空間，然後鼓勵MLLM在擴展的問題空間中有效探索多樣化的推理軌跡，並在RL過程中跨擴展問題共享發現的推理軌跡。此外，Share-GRPO還在優勢計算過程中共享獎勵信息，這使得能夠在問題變體之間和內部層次化地估計解決方案的優勢，從而更準確地估計相對優勢並提高策略訓練的穩定性。在六個廣泛使用的推理基準上的廣泛評估展示了我們方法的優越性能。代碼將在https://github.com/HJYao00/R1-ShareVL上提供。

English

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

R1-ShareVL：通过Share-GRPO激励多模态大语言模型的推理能力

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

摘要

Support