VLMsにおける密なメトリック深度推定の実現

要旨

視覚言語モデル（VLM）は、グラウンディングやキャプショニングといった2Dタスクに優れる一方、3D理解においては限定的である。その主な制約はテキストのみの教師パラダイムにあり、細粒度の視覚認識を十分に拘束できず、密な幾何情報の復元を妨げる。従来手法では、外部の視覚モデルから幾何情報を蒸留することで誤差が蓄積されるか、効率の悪いピクセル単位のクエリや粗いトークンレベルの出力による直接予測に留まっていた。本論文では、VLMのマルチモーダル能力を保持しつつ単一のVLMをネイティブな密幾何予測器に変換する、簡潔かつ効果的なフレームワークDepthVLMを提案する。LLMバックボーンに軽量な深度ヘッドを付加し、2段階のスケジュールによる統一的な視覚テキスト教師パラダイムの下で訓練することで、DepthVLMは単一の順伝搬で言語出力とともにフル解像度の深度マップを生成する。さらに、VLM互換形式の統一的な屋内・屋外メトリック深度ベンチマークを導入する。実験により、DepthVLMは既存のVLMを大幅に上回る推論効率を示し、主要な純視覚モデルを凌駕し、複雑な3D空間推論を向上させ、真に統一された基盤モデルへと前進することを実証する。すべてのコードとチェックポイントは公開される予定である。

English

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.