DVD：基于生成先验的确定性视频深度估计算法

摘要

现有视频深度估计面临一个根本性权衡：生成式模型易受随机几何幻觉和尺度漂移影响，而判别式模型需要海量标注数据来解决语义歧义。为突破这一困境，我们提出DVD——首个将预训练视频扩散模型确定性适配为单次推理深度回归器的框架。具体而言，DVD具备三大核心设计：（i）将扩散时间步重构为结构锚点，以平衡全局稳定性与高频细节；（ii）潜在流形矫正技术（LMR），通过施加微分约束缓解回归导致的过度平滑问题，恢复锐利边界与连贯运动；（iii）全局仿射一致性这一固有特性，可约束窗口间差异，实现无需复杂时序对齐的长视频无缝推理。大量实验表明，DVD在多个基准测试中实现了零样本状态最优性能。此外，DVD仅使用领先基线1/163的任务特定数据，便成功解锁了视频基础模型中隐含的深层几何先验。值得注意的是，我们完整开源了训练流水线，为开源社区提供整套达到SOTA水平的视频深度估计训练方案。

English

Existing video depth estimation faces a fundamental trade-off: generative models suffer from stochastic geometric hallucinations and scale drift, while discriminative models demand massive labeled datasets to resolve semantic ambiguities. To break this impasse, we present DVD, the first framework to deterministically adapt pre-trained video diffusion models into single-pass depth regressors. Specifically, DVD features three core designs: (i) repurposing the diffusion timestep as a structural anchor to balance global stability with high-frequency details; (ii) latent manifold rectification (LMR) to mitigate regression-induced over-smoothing, enforcing differential constraints to restore sharp boundaries and coherent motion; and (iii) global affine coherence, an inherent property bounding inter-window divergence, which enables seamless long-video inference without requiring complex temporal alignment. Extensive experiments demonstrate that DVD achieves state-of-the-art zero-shot performance across benchmarks. Furthermore, DVD successfully unlocks the profound geometric priors implicit in video foundation models using 163x less task-specific data than leading baselines. Notably, we fully release our pipeline, providing the whole training suite for SOTA video depth estimation to benefit the open-source community.