DepthLM:基于视觉语言模型的度量深度估计
DepthLM: Metric Depth From Vision Language Models
September 29, 2025
作者: Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, Yangyang Shi
cs.AI
摘要
视觉语言模型(VLMs)能够通过文本交互灵活处理多种视觉任务。尽管在语义理解方面取得了成功,包括GPT-5在内的最先进VLMs在从2D输入理解3D信息时仍面临挑战。另一方面,专业的纯视觉模型在度量深度估计这一关键3D理解任务上达到了超人的准确度,但它们需要特定任务的架构和损失函数。这种差异促使我们提出疑问:VLMs能否在不改变架构或损失函数的情况下达到专家级准确度?我们以逐像素度量深度估计作为代表性任务,并证明答案是肯定的!令人惊讶的是,综合分析表明,基于文本的稀疏标签监督微调足以让VLMs解锁强大的3D理解能力,无需密集预测头或复杂的回归/正则化损失。VLMs的瓶颈实际上在于像素引用和跨数据集相机模糊性,我们通过视觉提示和内在条件增强解决了这些问题。使用更小的模型,我们的方法DepthLM在准确度上超越了大多数先进VLMs超过2倍,首次使VLMs与纯视觉模型相媲美。有趣的是,在训练过程中没有明确强制执行的情况下,使用DepthLM训练的VLMs自然避免了过度平滑,在边界区域的飞点数量远少于纯视觉模型。DepthLM的简洁性还使得单个VLM能够覆盖度量深度之外的多种3D任务。我们的代码和模型将在以下链接发布。
English
Vision language models (VLMs) can flexibly address various vision tasks
through text interactions. Although successful in semantic understanding,
state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from
2D inputs. On the other hand, expert pure vision models achieve super-human
accuracy in metric depth estimation, a key 3D understanding task. However, they
require task-specific architectures and losses. Such difference motivates us to
ask: Can VLMs reach expert-level accuracy without architecture or loss change?
We take per-pixel metric depth estimation as the representative task and show
that the answer is yes! Surprisingly, comprehensive analysis shows that
text-based supervised-finetuning with sparse labels is sufficient for VLMs to
unlock strong 3D understanding, no dense prediction head or complex
regression/regularization loss is needed. The bottleneck for VLMs lies actually
in pixel reference and cross-dataset camera ambiguity, which we address through
visual prompting and intrinsic-conditioned augmentation. With much smaller
models, our method DepthLM surpasses the accuracy of most advanced VLMs by over
2x, making VLMs for the first time comparable with pure vision models.
Interestingly, without explicit enforcement during training, VLMs trained with
DepthLM naturally avoids over-smoothing, having much fewer flying points at
boundary regions than pure vision models. The simplicity of DepthLM also
enables a single VLM to cover various 3D tasks beyond metric depth. Our code
and model will be released at the link below.