VGR：視覺基礎推理

摘要

在多模態思維鏈（CoT）推理領域，現有方法主要依賴於純語言空間的推理，這本質上存在語言偏見，且大多局限於數學或科學領域。這種狹窄的關注點限制了它們處理需要全面理解圖像細節的複雜視覺推理任務的能力。為解決這些限制，本文介紹了VGR，一種具有增強細粒度視覺感知能力的新型多模態大語言模型（MLLM）。與傳統的MLLM僅在語言空間回答問題或進行推理不同，我們的VGR首先檢測可能有助於解決問題的相關區域，然後基於重播的圖像區域提供精確答案。為實現這一點，我們構建了一個名為VGR-SFT的大規模SFT數據集，其中包含混合視覺基礎和語言推導的推理數據。VGR的推理管道允許模型選擇視覺參考的邊界框，並引入重播階段將相應區域整合到推理過程中，從而增強多模態理解。在LLaVA-NeXT-7B基線上的實驗表明，VGR在需要全面理解圖像細節的多模態基準測試中表現優異。與基線相比，VGR僅使用30%的圖像標記數量，卻在MMStar上提升了4.1分，在AI2D上提升了7.1分，在ChartQA上提升了12.9分。

English

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.