VGR: 視覚的基盤型推論

要旨

マルチモーダル連鎖思考（CoT）推論の分野において、既存のアプローチは主に純粋な言語空間での推論に依存しており、これには言語バイアスが内在し、数学や科学の領域に大きく限定されている。この狭い焦点は、画像の詳細を包括的に理解する必要がある複雑な視覚推論タスクを扱う能力を制限している。これらの制約を解決するため、本論文では、細粒度の視覚知覚能力を強化した新しい推論型マルチモーダル大規模言語モデル（MLLM）であるVGRを提案する。従来のMLLMが質問に答えるか、言語空間のみで推論を行うのに対し、我々のVGRはまず問題解決に役立つ可能性のある関連領域を検出し、その後、再生された画像領域に基づいて正確な回答を提供する。これを実現するために、視覚的基盤と言語的推論を混合した推論データを含む大規模なSFTデータセットであるVGR-SFTを構築した。VGRの推論パイプラインでは、モデルが視覚的参照のためのバウンディングボックスを選択し、対応する領域を推論プロセスに統合する再生段階を導入することで、マルチモーダル理解を強化する。LLaVA-NeXT-7Bベースラインでの実験により、VGRは画像の詳細を包括的に理解する必要があるマルチモーダルベンチマークで優れた性能を発揮することが示された。ベースラインと比較して、VGRは画像トークン数の30％しか使用せず、MMStarで+4.1、AI2Dで+7.1、ChartQAで+12.9のスコア向上を達成した。

English

In the field of multimodal chain-of-thought (CoT) reasoning, existing approaches predominantly rely on reasoning on pure language space, which inherently suffers from language bias and is largely confined to math or science domains. This narrow focus limits their ability to handle complex visual reasoning tasks that demand comprehensive understanding of image details. To address these limitations, this paper introduces VGR, a novel reasoning multimodal large language model (MLLM) with enhanced fine-grained visual perception capabilities. Unlike traditional MLLMs that answer the question or reasoning solely on the language space, our VGR first detects relevant regions that may help to solve problems, and then provides precise answers based on replayed image regions. To achieve this, we conduct a large-scale SFT dataset called VGR -SFT that contains reasoning data with mixed vision grounding and language deduction. The inference pipeline of VGR allows the model to choose bounding boxes for visual reference and a replay stage is introduced to integrates the corresponding regions into the reasoning process, enhancing multimodel comprehension. Experiments on the LLaVA-NeXT-7B baseline show that VGR achieves superior performance on multi-modal benchmarks requiring comprehensive image detail understanding. Compared to the baseline, VGR uses only 30\% of the image token count while delivering scores of +4.1 on MMStar, +7.1 on AI2D, and a +12.9 improvement on ChartQA.

VGR: 視覚的基盤型推論

VGR: Visual Grounded Reasoning

要旨

Support