半離線策略強化學習於視覺語言慢思考推理

摘要

增強大型視覺語言模型（LVLMs）的視覺慢思維推理能力對於解決複雜的多模態任務至關重要。然而，由於LVLMs主要通過視覺語言對齊進行訓練，因此難以採用策略內強化學習（RL）來發展慢思維能力，因為其探索空間受到初始能力的限制。策略外RL提供了一種超越當前策略的方法，但直接從外部模型提取軌跡可能會因模型間視覺感知能力不匹配而導致視覺幻覺。為解決這些問題，本文提出了SOPHIA，一種簡單且可擴展的半策略外RL方法，用於視覺語言慢思維推理。SOPHIA通過結合可訓練LVLM的策略內視覺理解與語言模型的策略外慢思維推理，構建了一個半策略外行為模型，為推理分配基於結果的獎勵，並向後傳播視覺獎勵。然後，LVLM通過策略外RL算法從獲得的推理軌跡中學習慢思維推理能力。在InternVL2.5和InternVL3.0（8B和38B規模）上的大量實驗顯示了SOPHIA的有效性。值得注意的是，SOPHIA將InternVL3.0-38B的平均性能提升了8.50%，在多個多模態推理基準測試中達到了開源LVLMs的頂尖水平，甚至在具有挑戰性的MathVision和OlympiadBench上超越了一些閉源模型（例如GPT-4.1），分別達到了49.08%和49.95%的pass@1準確率。分析表明，SOPHIA優於監督微調和直接策略內RL方法，為進一步的策略內訓練提供了更好的策略初始化。

English

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

半離線策略強化學習於視覺語言慢思考推理

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

摘要

Support