iVGR: 透過強化學習將視覺基礎推理內化至多模態大型語言模型

摘要

儘管視覺化鏈式思考（Visually Grounded Chain-of-Thought, CoT）已成為增強多模態大型語言模型（MLLMs）細粒度感知能力的一種有前景的範式，但其在推理階段的有效性仍未被充分探討。在本研究中，我們實證發現，在推理過程中強制要求視覺化鏈式思考中的顯式物件邊界框，往往會比使用無顯式視覺化推理的標準文本鏈式思考表現更差。我們假設，視覺定位能力可以被內化至文本鏈式思考中，而強制的顯式視覺化推理會對模型的主要目標（答案預測）造成不必要的干擾。為解決此問題，我們提出「內化視覺化鏈式思考推理」（Internalizing Visually Grounded Reasoning, iVGR），這是一種新穎的強化學習框架，旨在將定位能力轉移至文本推理過程中。我們採用雙流訓練策略，透過所提出的「一致性獎勵」將文本流與高品質的視覺化流進行對齊，使模型在推理時無需顯式視覺化即可準確定位。大量實驗證明，我們的方法在細粒度基準測試中顯著優於現有基準方法，同時保留了支援工具輔助推理工作流程的靈活性。

English

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.