VLM-FO1: 고수준 추론과 세밀한 인식 간의 간극을 메우는 시각-언어 모델

초록

비전-언어 모델(VLMs)은 고차원적인 장면 이해에서는 뛰어난 성능을 보이지만, 정확한 위치 파악이 필요한 세밀한 인식 작업에서는 어려움을 겪습니다. 이러한 한계는 언어 중심 아키텍처가 정확한 수치 좌표를 생성하는 데 어려움을 겪는 근본적인 불일치에서 비롯됩니다. 본 논문에서는 이러한 한계를 극복하기 위해, 객체 중심 인식을 취약한 좌표 생성 문제에서 견고한 특징 검색 작업으로 재구성한 새로운 프레임워크인 VLM-FO1을 소개합니다. 우리의 방법은 사전 학습된 모든 VLM과 통합 가능한 플러그 앤 플레이 모듈로 작동합니다. 이는 이중 비전 인코더를 특징으로 하는 하이브리드 세밀 영역 인코더(HFRE)를 활용하여, 의미론적 및 공간적 세부 정보가 풍부한 강력한 영역 토큰을 생성합니다. 이후 토큰 기반 참조 시스템을 통해 대형 언어 모델(LLM)이 이러한 특정 시각 영역에 대해 원활하게 추론하고 언어를 기반으로 할 수 있게 합니다. 실험 결과, VLM-FO1은 다양한 벤치마크에서 최첨단 성능을 달성하며, 객체 기반, 영역 생성적 이해, 시각 영역 추론에서 탁월한 능력을 보여줍니다. 특히, 두 단계의 학습 전략을 통해 이러한 인식 성능 향상을 달성하면서도 기본 모델의 일반적인 시각 이해 능력을 저해하지 않습니다. VLM-FO1은 고차원적 추론과 세밀한 시각적 기반 간의 격차를 해소하며, 인식 인지 VLMs 구축을 위한 효과적이고 유연한 패러다임을 확립합니다.

English

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

VLM-FO1: 고수준 추론과 세밀한 인식 간의 간극을 메우는 시각-언어 모델

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

초록

Support