Perceptio: 공간 토큰 생성을 통한 시각 언어 모델의 인식 향상

초록

대규모 시각 언어 모델(LVLM)은 의미론적 이해에서는 뛰어나지만, 모델이 공간적 해석을 생성하지 않고 복잡한 기하학을 암묵적으로 추론해야 하기 때문에 세밀한 공간 기반 이해에는 어려움을 겪습니다. 본 연구에서는 자동회귀 시퀀스 내에서 직접 생성되는 명시적인 의미 분할 토큰과 깊이 토큰을 통해 2D 및 3D 공간 추론 능력을 갖춘 인식 강화형 LVLM인 Perceptio를 제시합니다. 구체적으로, 우리는 (i) 강력한 단안 교사 모델로부터 VQ-VAE 깊이 코드북을 추출하여 조밀한 깊이 정보를 컴팩트한 시퀀스로 토큰화하고, (ii) SAM2 기반 의미 분할 토큰과 VQ-VAE 깊이 토큰을 LLM 내부에 통합하여 모델이 먼저 공간 토큰을 생성한 후 답변하도록 합니다. 깊이 토큰 생성을 안정화하기 위해 새로운 복합 깊이 토큰 목적 함수(마커, 토큰, 카운트 손실)와 미분 가능 재구성을 위한 소프트 병합 기법을 도입합니다. 다양한 데이터셋에 걸친 다중 작업 공동 훈련 전략을 채택하여 모델이 다수의 하위 작업을 처리하기 위한 인식 토큰을 학습하도록 합니다. InternVL을 기반으로 구축된 Perceptio는 다양한 벤치마크에서 최첨단 성능을 달성합니다: RefCOCO/+/g에서 참조 표현 분할 성능을 cIoU 기준 +0.8/+1.4/+1.1만큼 향상시키고, HardBLINK 공간 이해 정확도를 10.3% 향상시키며, MMBench 정확도를 1.0% 향상시켜 명시적인 공간 사고 연쇄가 LVLM의 공간 기반 이해를 실질적으로 강화함을 입증합니다.

English

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

Perceptio: 공간 토큰 생성을 통한 시각 언어 모델의 인식 향상

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

초록

Support