iVGR: 強化学習を用いたMLLMの視覚的接地推論の内在化

要旨

視覚的に基づいた思考連鎖（Visually Grounded Chain-of-Thought, CoT）は、マルチモーダル大規模言語モデル（MLLMs）における細粒度認識を強化する有望なパラダイムとして登場したが、推論フェーズにおけるその有効性は未だ十分に探求されていない。本研究では、推論中に視覚的に基づいたCoTにおいて明示的なオブジェクトボックスを必須とすることが、明示的な視覚的根拣なしで推論を行う標準的なテキストCoTと比較して、しばしば性能を低下させることを経験的に発見した。我々は、視覚的定位能力がテキストCoTに内在化され得る一方で、必須の明示的根拠付与がモデルの本来の目的である回答予測に不要な干渉をもたらすという仮説を立てる。この問題に対処するため、我々はInternalizing Visually Grounded Reasoning（iVGR）、すなわち定位能力をテキスト推論プロセスに移行させる新規の強化学習フレームワークを提案する。我々はデュアルストリーム学習戦略を採用し、提案する一貫性報酬を通じてテキストストリームを高品質な視覚的根拠付きストリームに整合させることで、推論中に明示的な根拠なしで正確に定位できるようにする。大規模な実験により、本手法が細粒度ベンチマークにおいて既存のベースラインを大幅に上回り、ツール支援推論ワークフローをサポートする柔軟性を維持することを実証する。

English

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.