ChatPaper.aiChatPaper

SophiaVL-R1:通过思维奖励机制增强多模态大语言模型的推理能力

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

May 22, 2025
作者: Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, Xiangyu Yue
cs.AI

摘要

近期研究表明,通过基于规则的强化学习(RL)结合结果奖励,能够有效激发多模态大语言模型(MLLMs)的强推理能力。然而,这一范式通常缺乏对最终结果产生过程的思维监督,导致模型可能习得次优的推理策略,从而影响其泛化能力。鉴于此,我们提出了SophiaVL-R1,旨在为该范式引入思维过程的奖励信号。为实现这一目标,我们首先训练了一个思维奖励模型,用于评估整个思维过程的质量。考虑到某些样本可能因奖励欺骗导致思维奖励不可靠,我们提出了Trust-GRPO方法,在训练过程中为思维奖励赋予可信度权重。该权重基于正确与错误答案对应响应的思维奖励比较计算得出,有助于减轻潜在不可靠思维奖励的影响。此外,我们设计了一种退火训练策略,随时间逐步降低思维奖励的权重,使模型在训练后期更依赖于精确的基于规则的结果奖励。实验表明,SophiaVL-R1在多个基准测试(如MathVisita、MMMU)上超越了一系列推理型MLLMs,展现出强大的推理与泛化能力。值得注意的是,尽管LLaVA-OneVision-72B的参数规模是SophiaVL-R1-7B的十倍,但后者在多数基准测试上表现更优。所有代码、模型及数据集均已公开于https://github.com/kxfan2002/SophiaVL-R1。
English
Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

Summary

AI-Generated Summary

PDF122May 23, 2025