半离策略强化学习在视觉语言慢思考推理中的应用

摘要

提升大型视觉语言模型（LVLMs）的视觉慢思考推理能力对于解决复杂多模态任务至关重要。然而，由于LVLMs主要依赖于视觉语言对齐进行训练，采用在线强化学习（RL）来发展慢思考能力较为困难，因为其探索空间受限于初始能力。离线RL提供了一种超越当前策略的途径，但直接从外部模型提取轨迹可能因模型间视觉感知能力不匹配而导致视觉幻觉。针对这些问题，本文提出了SOPHIA，一种简单且可扩展的半离线RL方法，用于视觉语言慢思考推理。SOPHIA通过结合可训练LVLM的在线视觉理解与语言模型的离线慢思考推理，构建了一个半离线行为模型，为推理分配基于结果的奖励，并向后传播视觉奖励。随后，LVLM利用通过离线RL算法获得的推理轨迹及传播的奖励，学习慢思考推理能力。在InternVL2.5和InternVL3.0（8B和38B规模）上的大量实验验证了SOPHIA的有效性。值得注意的是，SOPHIA使InternVL3.0-38B平均提升了8.50%，在多个多模态推理基准测试中达到了开源LVLMs的最先进性能，甚至在具有挑战性的MathVision和OlympiadBench上超越了部分闭源模型（如GPT-4.1），分别取得了49.08%和49.95%的pass@1准确率。分析表明，SOPHIA优于监督微调和直接在线RL方法，为后续在线训练提供了更好的策略初始化。

English

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

半离策略强化学习在视觉语言慢思考推理中的应用

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

摘要

Support