R1-ShareVL：Share-GRPOによるマルチモーダル大規模言語モデルの推論能力の促進

要旨

本研究では、強化学習（RL）を用いてマルチモーダル大規模言語モデル（MLLM）の推論能力を促進し、RL中のスパース報酬とアドバンテージ消失の問題を緩和する効果的なアプローチを開発することを目指しています。この目的のために、拡張された質問空間上で多様な推論軌跡を探索・共有することでこれらの問題に対処する新しいRLアプローチであるShare-GRPOを提案します。具体的には、Share-GRPOはまずデータ変換技術を用いて与えられた質問の質問空間を拡張し、その後MLLMが拡張された質問空間上で効果的に多様な推論軌跡を探索し、RL中に発見された推論軌跡を拡張された質問間で共有することを促します。さらに、Share-GRPOはアドバンテージ計算中に報酬情報も共有し、質問バリアント間および質問バリアント内で階層的に解のアドバンテージを推定することで、相対的なアドバンテージのより正確な推定を可能にし、ポリシー訓練の安定性を向上させます。6つの広く使用されている推論ベンチマークでの大規模な評価により、本手法の優れた性能が示されています。コードはhttps://github.com/HJYao00/R1-ShareVLで公開予定です。

English

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

R1-ShareVL：Share-GRPOによるマルチモーダル大規模言語モデルの推論能力の促進

R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO

要旨

Support