iVGR：通过强化学习内化多模态大语言模型的视觉基础推理

摘要

尽管基于视觉锚定的思维链（CoT）已成为增强多模态大语言模型（MLLMs）细粒度感知的有前景范式，但其在推理阶段的有效性仍未充分探索。本研究中，我们通过实验发现，在推理过程中强制要求显式目标框的视觉锚定CoT，其性能往往低于标准文本CoT（即无需显式视觉锚定的推理方式）。我们假设视觉定位能力可以内化到文本CoT中，而强制显式锚定会引入不必要的干扰，影响模型完成答案预测这一主要目标。针对此问题，我们提出内化视觉锚定推理（Internalizing Visually Grounded Reasoning, iVGR）——一种新型强化学习框架，将定位能力迁移至文本推理过程。我们采用双流训练策略，通过提出的连贯性奖励将文本流与高质量视觉锚定流对齐，使模型在推理阶段无需显式锚定即可准确定位。大量实验表明，我们的方法在细粒度基准测试中显著优于现有基线，同时保持支持工具辅助推理工作流的灵活性。

English

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (iVGR), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.