從影片擴散先驗中學習時間一致的影片深度
Learning Temporally Consistent Video Depth from Video Diffusion Priors
June 3, 2024
作者: Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Matteo Poggi, Yiyi Liao
cs.AI
摘要
本研究解決了視頻深度估計的挑戰,期望不僅實現每幀的準確性,更重要的是實現跨幀的一致性。我們不是直接從頭開始開發深度估計器,而是將預測任務重新制定為一個條件生成問題。這使我們能夠利用現有視頻生成模型中嵌入的先前知識,從而降低學習難度並增強泛化能力。具體而言,我們研究如何馴服公共的穩定視頻擴散(SVD),使用混合的圖像深度和視頻深度數據集從輸入視頻中預測可靠的深度。我們在實踐中確認了一種程序化的訓練策略 - 首先優化SVD的空間層,然後在保持空間層凍結的同時優化時間層 - 在空間準確性和時間一致性方面產生最佳結果。我們進一步研究了用於對任意長度視頻進行推斷的滑動窗口策略。我們的觀察表明效率和性能之間存在一個折衷,一幀重疊已經產生了良好的結果。大量實驗結果證明了我們的方法ChronoDepth優於現有替代方案,特別是在估計深度的時間一致性方面。此外,我們強調了更一致的視頻深度在深度條件下的視頻生成和新視角合成兩個實際應用中的好處。我們的項目頁面位於https://jhaoshao.github.io/ChronoDepth/{此http網址}。
English
This work addresses the challenge of video depth estimation, which expects
not only per-frame accuracy but, more importantly, cross-frame consistency.
Instead of directly developing a depth estimator from scratch, we reformulate
the prediction task into a conditional generation problem. This allows us to
leverage the prior knowledge embedded in existing video generation models,
thereby reducing learn- ing difficulty and enhancing generalizability.
Concretely, we study how to tame the public Stable Video Diffusion (SVD) to
predict reliable depth from input videos using a mixture of image depth and
video depth datasets. We empirically confirm that a procedural training
strategy - first optimizing the spatial layers of SVD and then optimizing the
temporal layers while keeping the spatial layers frozen - yields the best
results in terms of both spatial accuracy and temporal consistency. We further
examine the sliding window strategy for inference on arbitrarily long videos.
Our observations indicate a trade-off between efficiency and performance, with
a one-frame overlap already producing favorable results. Extensive experimental
results demonstrate the superiority of our approach, termed ChronoDepth, over
existing alternatives, particularly in terms of the temporal consistency of the
estimated depth. Additionally, we highlight the benefits of more consistent
video depth in two practical applications: depth-conditioned video generation
and novel view synthesis. Our project page is available at
https://jhaoshao.github.io/ChronoDepth/{this http URL}.Summary
AI-Generated Summary