感知增强型视觉语言模型：通过空间标记生成提升视觉理解能力

摘要

大型视觉语言模型（LVLM）在语义理解方面表现出色，但在细粒度空间定位方面存在不足，因为模型需要隐式推断复杂几何关系而从未生成空间解释。我们提出Perceptio——一种具备2D与3D空间推理能力的感知增强型LVLM，通过直接在自回归序列中生成的显式语义分割标记和深度标记实现该能力。具体而言，我们（i）从强单目深度估计模型中蒸馏出VQ-VAE深度码本，将稠密深度信息编码为紧凑序列；（ii）将基于SAM2的语义分割标记与VQ-VAE深度标记集成到LLM中，使模型先输出空间标记再生成答案。为稳定深度标记生成，我们引入新型复合深度标记目标函数（标记符损失、标记损失和计数损失）及支持微分重建的软融合技术。通过在多数据集上采用多任务协同训练策略，模型可学习感知标记以处理多种下游任务。基于InternVL架构的Perceptio在多项基准测试中达到最先进性能：在RefCOCO/+/g数据集上指代表达式分割cIoU指标提升+0.8/+1.4/+1.1，HardBLINK空间理解准确率提高10.3%，MMBench准确率提升1.0%，证明显式空间思维链能实质性增强LVLM的空间定位能力。

English

Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

感知增强型视觉语言模型：通过空间标记生成提升视觉理解能力

Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

摘要

Support