VLM-FO1:彌合視覺語言模型高層推理與細粒度感知之間的鴻溝
VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs
September 30, 2025
作者: Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, Tiancheng Zhao
cs.AI
摘要
視覺-語言模型(VLMs)擅長於高層次的場景理解,但在需要精確定位的細粒度感知任務上表現欠佳。這一不足源於根本性的不匹配,因為生成精確的數值座標對於以語言為中心的架構而言是一項挑戰性任務。本文中,我們介紹了VLM-FO1,這是一種新穎的框架,通過將以物體為中心的感知從脆弱的座標生成問題轉化為穩健的特徵檢索任務,從而克服了這一限制。我們的方法作為一個即插即用的模組,能夠與任何預訓練的VLM集成。它利用了一種混合細粒度區域編碼器(HFRE),該編碼器配備了雙重視覺編碼器,以生成富含語義和空間細節的強大區域標記。隨後,基於標記的參考系統使得大型語言模型(LLM)能夠無縫地對這些特定視覺區域進行推理並將語言與之對應。實驗表明,VLM-FO1在多樣化的基準測試中達到了最先進的性能,展現了在物體定位、區域生成理解及視覺區域推理方面的卓越能力。關鍵在於,我們的兩階段訓練策略確保了這些感知能力的提升不會損害基礎模型的通用視覺理解能力。VLM-FO1為構建具備感知能力的VLMs建立了一種有效且靈活的範式,彌合了高層次推理與細粒度視覺定位之間的鴻溝。
English
Vision-Language Models (VLMs) excel at high-level scene understanding but
falter on fine-grained perception tasks requiring precise localization. This
failure stems from a fundamental mismatch, as generating exact numerical
coordinates is a challenging task for language-centric architectures. In this
paper, we introduce VLM-FO1, a novel framework that overcomes this limitation
by reframing object-centric perception from a brittle coordinate generation
problem into a robust feature retrieval task. Our method operates as a
plug-and-play module that integrates with any pre-trained VLM. It leverages a
Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to
generate powerful region tokens rich in both semantic and spatial detail. A
token-based referencing system then enables the LLM to seamlessly reason about
and ground language in these specific visual regions. Experiments show that
VLM-FO1 achieves state-of-the-art performance across a diverse suite of
benchmarks, demonstrating exceptional capabilities in object grounding, region
generational understanding, and visual region reasoning. Crucially, our
two-stage training strategy ensures that these perception gains are achieved
without compromising the base model's general visual understanding
capabilities. VLM-FO1 establishes an effective and flexible paradigm for
building perception-aware VLMs, bridging the gap between high-level reasoning
and fine-grained visual grounding.