DepthLM: 視覚言語モデルによるメトリック深度推定

要旨

ビジョン言語モデル（VLMs）は、テキストインタラクションを通じて様々な視覚タスクを柔軟に処理することができます。意味理解においては成功を収めているものの、GPT-5を含む最先端のVLMsでも、2D入力から3Dを理解する点では依然として苦戦しています。一方、専門家による純粋な視覚モデルは、3D理解の重要なタスクであるメトリック深度推定において、人間を超える精度を達成しています。しかし、これらのモデルはタスク固有のアーキテクチャと損失関数を必要とします。この違いから、我々は次の疑問を抱きました：VLMsはアーキテクチャや損失関数を変更せずに、専門家レベルの精度を達成できるのか？我々はピクセル単位のメトリック深度推定を代表的なタスクとして取り上げ、その答えが「イエス」であることを示します。驚くべきことに、包括的な分析により、スパースラベルを用いたテキストベースの教師ありファインチューニングだけで、VLMsが強力な3D理解を発揮することが明らかになりました。密な予測ヘッドや複雑な回帰/正則化損失は必要ありません。VLMsのボトルネックは実際にはピクセル参照とクロスデータセットのカメラ曖昧性にあり、これらを視覚的プロンプティングと固有条件付き拡張によって解決します。はるかに小さいモデルで、我々の手法DepthLMは、最先端のVLMsの精度を2倍以上上回り、VLMsが初めて純粋な視覚モデルと比較可能なレベルに達しました。興味深いことに、トレーニング中に明示的に強制しなくても、DepthLMでトレーニングされたVLMsは自然に過剰平滑化を回避し、境界領域での飛び点が純粋な視覚モデルよりもはるかに少なくなります。DepthLMのシンプルさにより、単一のVLMがメトリック深度を超えた様々な3Dタスクをカバーすることも可能になります。我々のコードとモデルは以下のリンクで公開されます。

English

Vision language models (VLMs) can flexibly address various vision tasks through text interactions. Although successful in semantic understanding, state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from 2D inputs. On the other hand, expert pure vision models achieve super-human accuracy in metric depth estimation, a key 3D understanding task. However, they require task-specific architectures and losses. Such difference motivates us to ask: Can VLMs reach expert-level accuracy without architecture or loss change? We take per-pixel metric depth estimation as the representative task and show that the answer is yes! Surprisingly, comprehensive analysis shows that text-based supervised-finetuning with sparse labels is sufficient for VLMs to unlock strong 3D understanding, no dense prediction head or complex regression/regularization loss is needed. The bottleneck for VLMs lies actually in pixel reference and cross-dataset camera ambiguity, which we address through visual prompting and intrinsic-conditioned augmentation. With much smaller models, our method DepthLM surpasses the accuracy of most advanced VLMs by over 2x, making VLMs for the first time comparable with pure vision models. Interestingly, without explicit enforcement during training, VLMs trained with DepthLM naturally avoids over-smoothing, having much fewer flying points at boundary regions than pure vision models. The simplicity of DepthLM also enables a single VLM to cover various 3D tasks beyond metric depth. Our code and model will be released at the link below.

DepthLM: 視覚言語モデルによるメトリック深度推定

DepthLM: Metric Depth From Vision Language Models

要旨

Support