R1-VL：通過逐步群組相對策略優化學習多模態大型語言模型的推理能力

摘要

近期研究通常通過在高質量的思維鏈推理數據上進行監督微調來增強多模態大語言模型（MLLMs）的推理能力，這往往導致模型僅僅模仿成功的推理路徑，而未能理解錯誤的推理路徑為何。在本研究中，我們旨在提升MLLMs的推理能力，使其超越被動模仿正面推理路徑的範疇。為此，我們設計了逐步群組相對策略優化（StepGRPO），這是一種新的在線強化學習框架，使MLLMs能夠通過簡單、有效且密集的逐步獎勵機制自我提升推理能力。具體而言，StepGRPO引入了兩種基於規則的新穎推理獎勵：逐步推理準確性獎勵（StepRAR）和逐步推理有效性獎勵（StepRVR）。StepRAR通過軟關鍵步驟匹配技術獎勵包含必要中間推理步驟的推理路徑，而StepRVR則通過推理完整性和邏輯評估策略獎勵遵循結構良好且邏輯一致的推理過程的路徑。基於提出的StepGRPO，我們推出了R1-VL系列，這是一組在逐步推理方面表現卓越的MLLMs。在8個基準測試上的廣泛實驗證明了我們方法的優越性。

English

Recent studies generally enhance MLLMs' reasoning capabilities via supervised fine-tuning on high-quality chain-of-thought reasoning data, which often leads models to merely imitate successful reasoning paths without understanding what the wrong reasoning paths are. In this work, we aim to enhance the MLLMs' reasoning ability beyond passively imitating positive reasoning paths. To this end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding. Specifically, StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary intermediate reasoning steps via a soft key-step matching technique, while StepRAR rewards reasoning paths that follow a well-structured and logically consistent reasoning process through a reasoning completeness and logic evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive experiments over 8 benchmarks demonstrate the superiority of our methods.

R1-VL：通過逐步群組相對策略優化學習多模態大型語言模型的推理能力

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

摘要

Support