解锁视觉语言模型中的密集度量深度估计
Unlocking Dense Metric Depth Estimation in VLMs
May 15, 2026
作者: Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei ke
cs.AI
摘要
视觉-语言模型(VLMs)在二维任务(如定位与描述)中表现出色,但在三维理解方面仍存在局限。其关键限制在于仅依赖文本监督的范式,这种约束不足的模式限制了细粒度视觉感知,并阻碍了密集几何结构的恢复。现有方法要么通过外部视觉模型提取几何特征(导致误差累积),要么采用低效的逐像素查询或粗粒度的词元级输出进行直接预测。本文提出DepthVLM——一个简洁而高效的框架,可将单一的VLM转化为原生密集几何预测器,同时保留其多模态能力。通过在LLM骨干上附加轻量级深度头,并在统一视觉-文本监督范式下采用两阶段训练策略,DepthVLM可在单次前向传播中同步生成全分辨率深度图与语言输出。我们还引入了一种统一室内外度量深度基准数据集,采用与VLM兼容的格式。实验表明,DepthVLM在推理效率上显著超越现有VLM,领先于领先的纯视觉模型,并提升了复杂三维空间推理能力,向着真正统一的基石模型迈进。所有代码和模型检查点将公开发布。
English
Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.