解鎖視覺語言模型中的密集度量深度估計

摘要

視覺語言模型（Vision-Language Models, VLMs）在二維任務（如定位與描述）中表現優異，但在三維理解方面仍受限。其關鍵限制在於純文字監督範式，無法充分約束細粒度視覺感知，亦難以恢復密集幾何結構。既有方法或從外部視覺模型蒸餾幾何資訊，導致誤差累積；或逐像素查詢與粗粒度詞元級別輸出進行直接預測，但效率低落。本文提出DepthVLM，一個簡潔而有效的框架，能將單一視覺語言模型轉化為原生密集幾何預測器，同時保留其多模態能力。透過在大型語言模型骨幹上附加輕量級深度頭，並以兩階段排程在統一視覺-文本監督範式下訓練，DepthVLM能在單次前向傳遞中同時生成全解析度深度圖與語言輸出。我們進一步引入一個以視覺語言模型相容格式統一的室內外度量深度基準測試。實驗結果顯示，DepthVLM以更高推理效率顯著優於現有視覺語言模型，超越領先的純視覺模型，並提升複雜三維空間推理能力，朝向真正的統一基礎模型邁進。所有程式碼與檢查點將公開釋出。

English

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.