R1-ShareVL: Share-GRPO를 통한 멀티모달 대규모 언어 모델의 추론 능력 강화

초록

본 연구에서는 강화 학습(RL)을 통해 다중 모달 대형 언어 모델(MLLM)의 추론 능력을 강화하고, RL 과정에서 발생하는 희소 보상 및 이점 소실 문제를 완화하는 효과적인 접근 방식을 개발하고자 합니다. 이를 위해 우리는 Share-GRPO라는 새로운 RL 접근 방식을 제안합니다. 이 방법은 확장된 질문 공간에서 다양한 추론 경로를 탐색하고 공유함으로써 이러한 문제를 해결합니다. 구체적으로, Share-GRPO는 먼저 데이터 변환 기술을 통해 주어진 질문에 대한 질문 공간을 확장한 다음, MLLM이 확장된 질문 공간에서 다양한 추론 경로를 효과적으로 탐색하도록 장려하고, RL 과정에서 확장된 질문들 간에 발견된 추론 경로를 공유합니다. 또한, Share-GRPO는 이점 계산 과정에서 보상 정보를 공유하여, 질문 변형들 간 및 내부에서 계층적으로 솔루션 이점을 추정함으로써 상대적 이점을 더 정확하게 추정하고 정책 훈련의 안정성을 향상시킵니다. 널리 사용되는 여섯 가지 추론 벤치마크에 대한 광범위한 평가를 통해 우리 방법의 우수한 성능을 입증했습니다. 코드는 https://github.com/HJYao00/R1-ShareVL에서 확인할 수 있습니다.

English

In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.