R1-VL:通過逐步群組相對策略優化學習多模態大型語言模型的推理能力
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
March 17, 2025
作者: Jingyi Zhang, Jiaxing Huang, Huanjin Yao, Shunyu Liu, Xikun Zhang, Shijian Lu, Dacheng Tao
cs.AI
摘要
近期研究通常通過在高質量的思維鏈推理數據上進行監督微調來增強多模態大語言模型(MLLMs)的推理能力,這往往導致模型僅僅模仿成功的推理路徑,而未能理解錯誤的推理路徑為何。在本研究中,我們旨在提升MLLMs的推理能力,使其超越被動模仿正面推理路徑的範疇。為此,我們設計了逐步群組相對策略優化(StepGRPO),這是一種新的在線強化學習框架,使MLLMs能夠通過簡單、有效且密集的逐步獎勵機制自我提升推理能力。具體而言,StepGRPO引入了兩種基於規則的新穎推理獎勵:逐步推理準確性獎勵(StepRAR)和逐步推理有效性獎勵(StepRVR)。StepRAR通過軟關鍵步驟匹配技術獎勵包含必要中間推理步驟的推理路徑,而StepRVR則通過推理完整性和邏輯評估策略獎勵遵循結構良好且邏輯一致的推理過程的路徑。基於提出的StepGRPO,我們推出了R1-VL系列,這是一組在逐步推理方面表現卓越的MLLMs。在8個基準測試上的廣泛實驗證明了我們方法的優越性。
English
Recent studies generally enhance MLLMs' reasoning capabilities via supervised
fine-tuning on high-quality chain-of-thought reasoning data, which often leads
models to merely imitate successful reasoning paths without understanding what
the wrong reasoning paths are. In this work, we aim to enhance the MLLMs'
reasoning ability beyond passively imitating positive reasoning paths. To this
end, we design Step-wise Group Relative Policy Optimization (StepGRPO), a new
online reinforcement learning framework that enables MLLMs to self-improve
reasoning ability via simple, effective and dense step-wise rewarding.
Specifically, StepGRPO introduces two novel rule-based reasoning rewards:
Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity
Reward (StepRVR). StepRAR rewards the reasoning paths that contain necessary
intermediate reasoning steps via a soft key-step matching technique, while
StepRAR rewards reasoning paths that follow a well-structured and logically
consistent reasoning process through a reasoning completeness and logic
evaluation strategy. With the proposed StepGRPO, we introduce R1-VL, a series
of MLLMs with outstanding capabilities in step-by-step reasoning. Extensive
experiments over 8 benchmarks demonstrate the superiority of our methods.Summary
AI-Generated Summary