深度學習：釋放大規模未標記數據的潛力

摘要

本研究提出了「深度萬能」（Depth Anything），這是一個極為實用的解決方案，用於強健的單目深度估計。我們的目標不是追求新穎的技術模組，而是建立一個簡單而強大的基礎模型，可以應對任何情況下的任何圖像。為此，我們通過設計一個數據引擎來擴大數據集，收集並自動標註大規模未標記數據（約62百萬），從而顯著擴大了數據覆蓋範圍，進而能夠減少泛化誤差。我們研究了兩種簡單而有效的策略，使數據擴大變得有前途。首先，通過利用數據增強工具創建一個更具挑戰性的優化目標。這迫使模型積極尋求額外的視覺知識並獲取強健的表示。其次，發展了一種輔助監督，強制模型從預先訓練的編碼器中繼承豐富的語義先驗。我們對其零樣本能力進行了廣泛評估，包括六個公共數據集和隨機拍攝的照片。它展示了令人印象深刻的泛化能力。此外，通過使用來自NYUv2和KITTI的度量深度信息對其進行微調，我們設定了新的最佳結果。我們更好的深度模型也導致更好的深度條件下的ControlNet。我們的模型已在https://github.com/LiheYoung/Depth-Anything 上發布。

English

This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

深度學習：釋放大規模未標記數據的潛力

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

摘要

Support