SophiaVL-R1：以思考獎勵強化多模態大語言模型的推理能力

摘要

近期研究顯示，通過基於規則的強化學習（RL）結合結果獎勵，能夠在多模態大型語言模型（MLLMs）中激發出強大的推理能力。然而，這種範式通常缺乏對最終結果產生過程的思維監督，導致模型可能學習到次優的推理策略，從而影響其泛化能力。針對這一問題，我們提出了SophiaVL-R1，嘗試在這一範式中加入思維過程的獎勵信號。為實現這一目標，我們首先訓練了一個思維獎勵模型，用於評估整個思維過程的質量。考慮到某些樣本可能因獎勵欺騙而導致思維獎勵不可靠，我們提出了Trust-GRPO方法，在訓練過程中為思維獎勵分配一個可信度權重。該權重基於正確答案與錯誤答案響應的思維獎勵比較計算得出，有助於減輕潛在不可靠思維獎勵的影響。此外，我們設計了一種退火訓練策略，隨著時間推移逐漸減少思維獎勵，使模型在後期訓練階段更多地依賴於精確的基於規則的結果獎勵。實驗表明，我們的SophiaVL-R1在多個基準測試（如MathVisita、MMMU）上超越了一系列推理MLLMs，展現出強大的推理和泛化能力。值得注意的是，儘管LLaVA-OneVision-72B的參數量是SophiaVL-R1-7B的10倍，但後者在大多數基準測試上仍表現更優。所有代碼、模型和數據集均已公開於https://github.com/kxfan2002/SophiaVL-R1。

English

Recent advances have shown success in eliciting strong reasoning abilities in multimodal large language models (MLLMs) through rule-based reinforcement learning (RL) with outcome rewards. However, this paradigm typically lacks supervision over the thinking process leading to the final outcome.As a result, the model may learn sub-optimal reasoning strategies, which can hinder its generalization ability. In light of this, we propose SophiaVL-R1, as an attempt to add reward signals for the thinking process in this paradigm. To achieve this, we first train a thinking reward model that evaluates the quality of the entire thinking process. Given that the thinking reward may be unreliable for certain samples due to reward hacking, we propose the Trust-GRPO method, which assigns a trustworthiness weight to the thinking reward during training. This weight is computed based on the thinking reward comparison of responses leading to correct answers versus incorrect answers, helping to mitigate the impact of potentially unreliable thinking rewards. Moreover, we design an annealing training strategy that gradually reduces the thinking reward over time, allowing the model to rely more on the accurate rule-based outcome reward in later training stages. Experiments show that our SophiaVL-R1 surpasses a series of reasoning MLLMs on various benchmarks (e.g., MathVisita, MMMU), demonstrating strong reasoning and generalization capabilities. Notably, our SophiaVL-R1-7B even outperforms LLaVA-OneVision-72B on most benchmarks, despite the latter having 10 times more parameters. All code, models, and datasets are made publicly available at https://github.com/kxfan2002/SophiaVL-R1.

SophiaVL-R1：以思考獎勵強化多模態大語言模型的推理能力

SophiaVL-R1: Reinforcing MLLMs Reasoning with Thinking Reward

摘要

Support