Perceptio: Modelli Linguistici Visivi con Percezione Potenziata tramite Generazione di Token Spaziali

Abstract

I Large Vision Language Model (LVLM) eccellono nella comprensione semantica ma faticano nel grounding spaziale di dettaglio, poiché il modello deve inferire implicitamente geometrie complesse senza mai produrre un'interpretazione spaziale. Presentiamo Perceptio, un LVLM potenziato percettivamente con capacità di ragionamento spaziale 2D e 3D, abilitato tramite token di segmentazione semantica e token di profondità generati direttamente all'interno della sequenza autoregressiva. Nello specifico, (i) distilliamo un codebook di profondità VQ-VAE da un forte teacher monoculare per tokenizzare la profondità densa in sequenze compatte, e (ii) integriamo token di segmentazione semantica basati su SAM2 e token di profondità VQ-VAE all'interno dell'LLM, in modo che il modello emetta prima i token spaziali e poi risponda. Per stabilizzare la generazione dei token di profondità, introduciamo nuovi obiettivi compositi per i depth-token (loss marker, token e conteggio) e una tecnica di soft-merging per la ricostruzione differenziabile. Adottiamo una strategia di co-addestramento multi-task su dataset diversificati, permettendo al modello di apprendere i token percettivi per affrontare molteplici task downstream. Basandoci su InternVL, Perceptio raggiunge prestazioni state-of-the-art su diversi benchmark: migliora la segmentazione di espressioni referenziali di +0.8/+1.4/+1.1 cIoU su RefCOCO/+/g, aumenta l'accuratezza di comprensione spaziale HardBLINK del 10.3% e l'accuratezza MMBench dell'1.0%, dimostrando che un explicit spatial chain-of-thought rafforza materialmente il grounding spaziale negli LVLM.

English

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

Perceptio: Modelli Linguistici Visivi con Percezione Potenziata tramite Generazione di Token Spaziali

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Abstract

Support