視野角条件付き拡散モデルによるゼロショットメトリック深度推定

要旨

単眼深度推定の手法は標準ベンチマークにおいて大きな進歩を遂げてきたが、ゼロショットメトリック深度推定は未解決の課題である。主な課題として、RGBと深度の分布が大きく異なる屋内と屋外シーンの共同モデリング、および未知のカメラ内部パラメータに起因する深度スケールの曖昧さが挙げられる。最近の研究では、屋内と屋外シーンを共同でモデル化するための専門的なマルチヘッドアーキテクチャが提案されている。これに対して我々は、汎用的でタスクに依存しない拡散モデルを提唱し、屋内と屋外シーンの共同モデリングを可能にする対数スケール深度パラメータ化、スケール曖昧性を扱うための視野角（FOV）条件付け、そしてトレーニングデータセットの限られたカメラ内部パラメータを超えて一般化するためにトレーニング中にFOVを合成的に拡張するといったいくつかの進歩を実現した。さらに、一般的なものよりも多様なトレーニング混合と効率的な拡散パラメータ化を採用することで、我々の手法DMD（Diffusion for Metric Depth）は、わずかなノイズ除去ステップのみを使用して、現在のSOTAと比較してゼロショット屋内データセットで25%、ゼロショット屋外データセットで33%の相対誤差（REL）の削減を達成した。概要についてはhttps://diffusion-vision.github.io/dmdを参照のこと。

English

While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd

視野角条件付き拡散モデルによるゼロショットメトリック深度推定

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model

要旨

Support