深度万物3:从任意视角重建视觉空间
Depth Anything 3: Recovering the Visual Space from Any Views
November 13, 2025
作者: Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, Bingyi Kang
cs.AI
摘要
我们推出Depth Anything 3(DA3),该模型能够从任意数量的视觉输入中预测空间一致的几何结构,无论是否已知相机位姿。为实现极简建模,DA3带来两大核心发现:单一标准Transformer(如原始DINO编码器)足以作为主干网络而无需架构特化;单一深度射线预测目标可规避复杂的多任务学习需求。通过师生训练范式,该模型在细节还原与泛化能力上达到了与Depth Anything 2(DA2)相当的水准。我们建立了涵盖相机位姿估计、任意视角几何重建与视觉渲染的新视觉几何基准。在此基准测试中,DA3在所有任务上均创下新纪录,相机位姿准确率较先前最优方法VGGT平均提升44.3%,几何准确率提升25.1%。此外,其在单目深度估计任务上也超越了DA2。所有模型均仅使用公开学术数据集进行训练。
English
We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.