VLM-FO1: 高次推論と詳細知覚のギャップを埋めるVLMの構築

要旨

Vision-Language Models（VLM）は、高レベルのシーン理解において優れた性能を発揮する一方で、正確な位置特定を必要とする細粒度の知覚タスクでは課題を抱えている。この課題は、言語中心のアーキテクチャにとって正確な数値座標を生成することが困難であるという根本的なミスマッチに起因している。本論文では、この制約を克服する新たなフレームワーク「VLM-FO1」を提案する。このフレームワークは、オブジェクト中心の知覚を脆い座標生成問題から堅牢な特徴検索タスクへと再定義することで、この課題を解決する。本手法は、事前学習済みの任意のVLMと統合可能なプラグアンドプレイモジュールとして機能する。Hybrid Fine-grained Region Encoder（HFRE）を活用し、デュアルビジョンエンコーダを特徴とする強力なリージョントークンを生成する。これらのトークンは、意味的および空間的な詳細情報を豊富に含んでいる。その後、トークンベースの参照システムにより、大規模言語モデル（LLM）がこれらの特定の視覚領域についてシームレスに推論し、言語を接地することが可能となる。実験結果から、VLM-FO1は多様なベンチマークにおいて最先端の性能を達成し、オブジェクト接地、リージョン生成理解、視覚領域推論において卓越した能力を示すことが確認された。特に、2段階のトレーニング戦略により、これらの知覚能力の向上が基本モデルの一般的な視覚理解能力を損なうことなく実現されている。VLM-FO1は、高レベルの推論と細粒度の視覚接地の間のギャップを埋める、知覚を意識したVLMを構築するための効果的かつ柔軟なパラダイムを確立する。

English

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

VLM-FO1: 高次推論と詳細知覚のギャップを埋めるVLMの構築

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

要旨

Support