ChatPaper.aiChatPaper

DepthCrafter:為開放世界影片生成一致且長的深度序列

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

September 3, 2024
作者: Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan
cs.AI

摘要

儘管在靜態圖像的單眼深度估計方面取得了顯著進展,但在開放世界中估計視頻深度仍然具有挑戰性,因為開放世界的視頻在內容、運動、攝像機運動和長度上極為多樣。我們提出了DepthCrafter,一種創新方法,用於為開放世界視頻生成具有精細細節的時間一致的長深度序列,而無需任何額外信息,如攝像機姿勢或光流。DepthCrafter通過從預先訓練的圖像到視頻擴散模型訓練視頻到深度模型,通過我們精心設計的三階段訓練策略,以編譯的配對視頻深度數據集實現對開放世界視頻的泛化能力。我們的訓練方法使模型能夠一次生成具有可變長度的深度序列,最多達到110幀,並從現實和合成數據集中獲取精確的深度細節和豐富的內容多樣性。我們還提出了一種通過分段估計和無縫拼接處理極長視頻的推理策略。在多個數據集上的全面評估顯示,DepthCrafter在零樣本設置下實現了開放世界視頻深度估計的最先進性能。此外,DepthCrafter促進了各種下游應用,包括基於深度的視覺效果和有條件的視頻生成。
English
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length. We present DepthCrafter, an innovative method for generating temporally consistent long depth sequences with intricate details for open-world videos, without requiring any supplementary information such as camera poses or optical flow. DepthCrafter achieves generalization ability to open-world videos by training a video-to-depth model from a pre-trained image-to-video diffusion model, through our meticulously designed three-stage training strategy with the compiled paired video-depth datasets. Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets. We also propose an inference strategy that processes extremely long videos through segment-wise estimation and seamless stitching. Comprehensive evaluations on multiple datasets reveal that DepthCrafter achieves state-of-the-art performance in open-world video depth estimation under zero-shot settings. Furthermore, DepthCrafter facilitates various downstream applications, including depth-based visual effects and conditional video generation.

Summary

AI-Generated Summary

PDF373November 16, 2024