Depth Anything: 大規模な未ラベルデータの力を解き放つ

要旨

本研究は、ロバストな単眼深度推定のための非常に実用的なソリューションであるDepth Anythingを提案する。新規の技術モジュールを追求するのではなく、あらゆる状況下でのあらゆる画像に対応するシンプルでありながら強力な基盤モデルの構築を目指す。この目的のために、データエンジンを設計して大規模な未ラベルデータ（約62M）を収集し自動的にアノテーションを行うことでデータセットをスケールアップし、データのカバレッジを大幅に拡大することで汎化誤差を低減できるようにした。データのスケールアップを可能にする2つのシンプルでありながら効果的な戦略を検討した。第一に、データ拡張ツールを活用してより挑戦的な最適化目標を作成し、モデルが積極的に追加の視覚的知識を探求し、ロバストな表現を獲得することを促す。第二に、事前学習済みエンコーダーから豊富な意味的プライアを継承するようにモデルを強化する補助的な監視を開発した。そのゼロショット能力を、6つの公開データセットとランダムに撮影された写真を含めて広範囲に評価し、印象的な汎化能力を示した。さらに、NYUv2とKITTIからのメトリック深度情報を用いてファインチューニングを行うことで、新たなSOTAを達成した。より優れた深度モデルは、深度条件付きControlNetの性能向上にも寄与する。我々のモデルはhttps://github.com/LiheYoung/Depth-Anythingで公開されている。

English

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

Depth Anything: 大規模な未ラベルデータの力を解き放つ

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

要旨

Support