VLM-FO1：弥合视觉语言模型中高层推理与细粒度感知的鸿沟

摘要

视觉语言模型（VLMs）在高层场景理解方面表现出色，但在需要精确定位的细粒度感知任务上却表现欠佳。这一缺陷源于根本性的不匹配，因为生成精确的数值坐标对于以语言为中心的架构而言是一项挑战性任务。本文提出VLM-FO1，一种新颖的框架，通过将对象中心感知从脆弱的坐标生成问题重构为稳健的特征检索任务，从而克服了这一局限。我们的方法作为一个即插即用模块，可与任何预训练的VLM集成。它利用混合细粒度区域编码器（HFRE），配备双重视觉编码器，生成富含语义和空间细节的强大区域标记。基于标记的引用系统随后使大型语言模型能够无缝地推理并将语言锚定到这些特定的视觉区域。实验表明，VLM-FO1在多样化的基准测试中实现了最先进的性能，展示了在对象定位、区域生成理解和视觉区域推理方面的卓越能力。关键在于，我们的两阶段训练策略确保了这些感知能力的提升不会损害基础模型的通用视觉理解能力。VLM-FO1为构建感知敏感的VLMs确立了一个有效且灵活的范式，弥合了高层推理与细粒度视觉定位之间的鸿沟。

English

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

VLM-FO1：弥合视觉语言模型中高层推理与细粒度感知的鸿沟

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

摘要

Support