ChatPaper.aiChatPaper

視訊深度萬象:針對超長視訊的一致深度估計

Video Depth Anything: Consistent Depth Estimation for Super-Long Videos

January 21, 2025
作者: Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, Bingyi Kang
cs.AI

摘要

Depth Anything 在單眼深度估計方面取得了顯著成功,具有強大的泛化能力。然而,在視頻中存在時間不一致性,阻礙了其實際應用。為了緩解這個問題,提出了各種方法,通過利用視頻生成模型或引入光流和相機姿勢的先驗來實現。然而,這些方法僅適用於短視頻(< 10秒),並需要在質量和計算效率之間取得平衡。我們提出了Video Depth Anything,用於在超長視頻(幾分鐘以上)中進行高質量、一致的深度估計,而不會犧牲效率。我們基於Depth Anything V2 模型,將其頭部替換為高效的時空頭部。我們設計了一個簡單而有效的時間一致性損失,通過限制時間深度梯度,消除了對額外幾何先驗的需求。該模型在視頻深度和未標記圖像的聯合數據集上進行訓練,與Depth Anything V2 相似。此外,還開發了一種基於關鍵幀的長視頻推斷策略。實驗表明,我們的模型可以應用於任意長度的視頻,而不會影響質量、一致性或泛化能力。在多個視頻基準測試上進行的全面評估表明,我們的方法在零樣本視頻深度估計方面設立了新的技術水準。我們提供不同規模的模型,以支持各種場景,我們最小的模型能夠以30 FPS 的實時性能運行。
English
Depth Anything has achieved remarkable success in monocular depth estimation with strong generalization ability. However, it suffers from temporal inconsistency in videos, hindering its practical applications. Various methods have been proposed to alleviate this issue by leveraging video generation models or introducing priors from optical flow and camera poses. Nonetheless, these methods are only applicable to short videos (< 10 seconds) and require a trade-off between quality and computational efficiency. We propose Video Depth Anything for high-quality, consistent depth estimation in super-long videos (over several minutes) without sacrificing efficiency. We base our model on Depth Anything V2 and replace its head with an efficient spatial-temporal head. We design a straightforward yet effective temporal consistency loss by constraining the temporal depth gradient, eliminating the need for additional geometric priors. The model is trained on a joint dataset of video depth and unlabeled images, similar to Depth Anything V2. Moreover, a novel key-frame-based strategy is developed for long video inference. Experiments show that our model can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Comprehensive evaluations on multiple video benchmarks demonstrate that our approach sets a new state-of-the-art in zero-shot video depth estimation. We offer models of different scales to support a range of scenarios, with our smallest model capable of real-time performance at 30 FPS.

Summary

AI-Generated Summary

PDF222January 22, 2025