R1-VL：通过逐步分组相对策略优化学习多模态大语言模型的推理能力

摘要

近期研究通常通过在高质量链式思维推理数据上进行监督微调来增强多模态大语言模型（MLLMs）的推理能力，但这往往导致模型仅模仿成功的推理路径，而未能理解错误推理路径的本质。本研究中，我们旨在提升MLLMs的推理能力，使其超越被动模仿正面推理路径的局限。为此，我们设计了逐步组相对策略优化（StepGRPO），这是一种新型在线强化学习框架，通过简单、有效且密集的逐步奖励机制，使MLLMs能够自我提升推理能力。具体而言，StepGRPO引入了两种基于规则的推理奖励：逐步推理准确度奖励（StepRAR）和逐步推理有效性奖励（StepRVR）。StepRAR通过软关键步骤匹配技术，奖励包含必要中间推理步骤的推理路径；而StepRAR则通过推理完整性和逻辑评估策略，奖励遵循结构良好、逻辑一致的推理过程。基于所提出的StepGRPO，我们推出了R1-VL系列MLLMs，该系列模型在逐步推理方面展现出卓越能力。在8个基准测试上的广泛实验验证了我们方法的优越性。

English

Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

R1-VL：通过逐步分组相对策略优化学习多模态大语言模型的推理能力

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

摘要

Support