半オフポリシー強化学習による視覚言語スローシンキング推論

要旨

大規模視覚言語モデル（LVLM）に視覚的スローシンキング推論を組み込むことは、複雑なマルチモーダルタスクを解決する上で重要です。しかし、LVLMは主に視覚と言語のアラインメントで訓練されているため、初期能力によってロールアウト空間が制限されるため、スローシンキング能力を開発するためのオン・ポリシー強化学習（RL）を採用することは困難です。オフ・ポリシーRLは現在のポリシーを超える方法を提供しますが、外部モデルから直接軌跡を蒸留すると、モデル間の視覚知覚能力の不一致により視覚的幻覚が生じる可能性があります。これらの問題に対処するため、本論文ではSOPHIAを提案します。SOPHIAは、訓練可能なLVLMからのオン・ポリシー視覚理解と言語モデルからのオフ・ポリシースローシンキング推論を組み合わせてセミ・オフ・ポリシー行動モデルを構築し、推論に結果ベースの報酬を割り当て、視覚報酬を後方に伝播します。その後、LVLMはオフ・ポリシーRLアルゴリズムを使用して、得られた推論軌跡から伝播された報酬を用いてスローシンキング推論能力を学習します。8Bおよび38BサイズのInternVL2.5とInternVL3.0を用いた広範な実験により、SOPHIAの有効性が示されました。特に、SOPHIAはInternVL3.0-38Bを平均8.50%向上させ、複数のマルチモーダル推論ベンチマークでオープンソースLVLMの中でも最先端の性能を達成し、挑戦的なMathVisionとOlympiadBenchでは一部のクローズドソースモデル（例：GPT-4.1）を上回り、それぞれ49.08%と49.95%のpass@1精度を達成しました。分析によると、SOPHIAは教師あり微調整と直接オン・ポリシーRL手法を上回り、さらなるオン・ポリシー訓練のためのより良いポリシー初期化を提供します。

English

Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

半オフポリシー強化学習による視覚言語スローシンキング推論

Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning

要旨

Support