深度萬物3:從任意視角重建視覺空間
Depth Anything 3: Recovering the Visual Space from Any Views
November 13, 2025
作者: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang
cs.AI
摘要
我們提出 Depth Anything 3 (DA3),這是一個能從任意數量視覺輸入(無論是否已知相機姿態)預測空間一致幾何的模型。為追求極簡建模,DA3 帶來兩項關鍵洞見:單一純粹 Transformer(如原始 DINO 編碼器)足以作為骨幹網絡而無需架構特化,且單一的深度射線預測目標可免除複雜的多任務學習需求。透過師生訓練範式,該模型在細節還原與泛化能力上達到與 Depth Anything 2 (DA2) 相當的水準。我們建立了涵蓋相機姿態估計、任意視角幾何重建與視覺渲染的新視覺幾何基準測試。在此基準上,DA3 在所有任務中均創下新紀錄,相機姿態準確率較先前最佳模型 VGGT 平均提升 44.3%,幾何準確率提升 25.1%。此外,其在單目深度估計任務上也優於 DA2。所有模型僅使用公開學術數據集進行訓練。
English
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.