DepthLM：基於視覺語言模型的深度度量

摘要

視覺語言模型（VLMs）能夠通過文本交互靈活處理多種視覺任務。儘管在語義理解方面取得了成功，包括GPT-5在內的最新VLMs在從2D輸入理解3D方面仍存在困難。另一方面，專業的純視覺模型在度量深度估計這一關鍵3D理解任務上達到了超人的準確度。然而，這些模型需要特定任務的架構和損失函數。這種差異促使我們提出問題：VLMs能否在不改變架構或損失函數的情況下達到專家級的準確度？我們以逐像素度量深度估計為代表性任務，並證明答案是肯定的！令人驚訝的是，綜合分析表明，基於文本的稀疏標籤監督微調足以讓VLMs解鎖強大的3D理解能力，無需密集預測頭或複雜的回歸/正則化損失函數。VLMs的瓶頸實際上在於像素參考和跨數據集相機模糊性，我們通過視覺提示和內在條件增強來解決這些問題。使用更小的模型，我們的方法DepthLM超越了大多數先進VLMs的準確度，使其首次與純視覺模型相媲美。有趣的是，在訓練過程中沒有明確強制的情況下，使用DepthLM訓練的VLMs自然避免了過度平滑，在邊界區域的飛點比純視覺模型少得多。DepthLM的簡潔性還使得單個VLM能夠涵蓋度量深度之外的各種3D任務。我們的代碼和模型將在以下鏈接中發布。

English

Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck for VLMs lies actually in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Interestingly, without explicit enforcement during training, VLMs trained with DepthLM naturally avoids over-smoothing, having much fewer flying points at boundary regions than pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Our code and model will be released at the link below.

DepthLM：基於視覺語言模型的深度度量

DepthLM: Metric Depth From Vision Language Models

摘要

Support