DepthLM:基於視覺語言模型的深度度量
DepthLM: Metric Depth From Vision Language Models
September 29, 2025
作者: Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, Yangyang Shi
cs.AI
摘要
視覺語言模型(VLMs)能夠通過文本交互靈活處理多種視覺任務。儘管在語義理解方面取得了成功,包括GPT-5在內的最新VLMs在從2D輸入理解3D方面仍存在困難。另一方面,專業的純視覺模型在度量深度估計這一關鍵3D理解任務上達到了超人的準確度。然而,這些模型需要特定任務的架構和損失函數。這種差異促使我們提出問題:VLMs能否在不改變架構或損失函數的情況下達到專家級的準確度?我們以逐像素度量深度估計為代表性任務,並證明答案是肯定的!令人驚訝的是,綜合分析表明,基於文本的稀疏標籤監督微調足以讓VLMs解鎖強大的3D理解能力,無需密集預測頭或複雜的回歸/正則化損失函數。VLMs的瓶頸實際上在於像素參考和跨數據集相機模糊性,我們通過視覺提示和內在條件增強來解決這些問題。使用更小的模型,我們的方法DepthLM超越了大多數先進VLMs的準確度,使其首次與純視覺模型相媲美。有趣的是,在訓練過程中沒有明確強制的情況下,使用DepthLM訓練的VLMs自然避免了過度平滑,在邊界區域的飛點比純視覺模型少得多。DepthLM的簡潔性還使得單個VLM能夠涵蓋度量深度之外的各種3D任務。我們的代碼和模型將在以下鏈接中發布。
English
Vision language models (VLMs) can flexibly address various vision tasks
through text interactions. Although successful in semantic understanding,
state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from
2D inputs. On the other hand, expert pure vision models achieve super-human
accuracy in metric depth estimation, a key 3D understanding task. However, they
require task-specific architectures and losses. Such difference motivates us to
ask: Can VLMs reach expert-level accuracy without architecture or loss change?
We take per-pixel metric depth estimation as the representative task and show
that the answer is yes! Surprisingly, comprehensive analysis shows that
text-based supervised-finetuning with sparse labels is sufficient for VLMs to
unlock strong 3D understanding, no dense prediction head or complex
regression/regularization loss is needed. The bottleneck for VLMs lies actually
in pixel reference and cross-dataset camera ambiguity, which we address through
visual prompting and intrinsic-conditioned augmentation. With much smaller
models, our method DepthLM surpasses the accuracy of most advanced VLMs by over
2x, making VLMs for the first time comparable with pure vision models.
Interestingly, without explicit enforcement during training, VLMs trained with
DepthLM naturally avoids over-smoothing, having much fewer flying points at
boundary regions than pure vision models. The simplicity of DepthLM also
enables a single VLM to cover various 3D tasks beyond metric depth. Our code
and model will be released at the link below.