VGR：视觉基础推理

摘要

在多模态思维链（CoT）推理领域，现有方法主要依赖于纯语言空间的推理，这本质上存在语言偏见，且大多局限于数学或科学领域。这种狭窄的聚焦限制了它们处理需要全面理解图像细节的复杂视觉推理任务的能力。为解决这些局限，本文引入了VGR，一种新型的多模态大语言模型（MLLM），具备增强的细粒度视觉感知能力。与仅在语言空间回答问题或进行推理的传统MLLM不同，我们的VGR首先检测可能有助于解决问题的相关区域，然后基于重放的图像区域提供精确答案。为此，我们构建了一个名为VGR-SFT的大规模SFT数据集，其中包含混合视觉定位与语言演绎的推理数据。VGR的推理流程允许模型选择用于视觉参考的边界框，并引入重放阶段，将相应区域整合到推理过程中，增强多模态理解能力。在LLaVA-NeXT-7B基线上的实验表明，VGR在需要全面理解图像细节的多模态基准测试中表现优异。与基线相比，VGR仅使用了30%的图像标记数量，却在MMStar上提升了4.1分，在AI2D上提升了7.1分，在ChartQA上实现了12.9分的显著提升。

English

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

VGR：视觉基础推理

VGR: Visual Grounded Reasoning

摘要

Support