Zero4D:利用現成視頻擴散模型從單一視頻進行免訓練的4D視頻生成
Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model
March 28, 2025
作者: Jangho Park, Taesung Kwon, Jong Chul Ye
cs.AI
摘要
近年來,多視角或4D視頻生成已成為一個重要的研究課題。然而,現有的4D生成方法仍面臨根本性限制,因為它們主要依賴於利用多個視頻擴散模型進行額外訓練,或是在有限的真實世界4D數據和巨大計算成本下,對完整的4D擴散模型進行計算密集型的訓練。為應對這些挑戰,我們在此提出了一種無需訓練的4D視頻生成方法,該方法利用現成的視頻擴散模型從單一輸入視頻生成多視角視頻。我們的方法包含兩個關鍵步驟:(1) 通過將時空採樣網格中的邊緣幀指定為關鍵幀,我們首先使用視頻擴散模型合成這些幀,並利用基於深度的變形技術進行引導。這種方法確保了生成幀之間的結構一致性,保持了空間和時間的連貫性。(2) 接著,我們使用視頻擴散模型對剩餘幀進行插值,構建一個完全填充且時間上連貫的採樣網格,同時保持空間和時間的一致性。通過這種方法,我們將單一視頻沿著新的相機軌跡擴展為多視角視頻,同時保持時空一致性。我們的方法無需訓練,並充分利用了現成的視頻擴散模型,為多視角視頻生成提供了一種實用且有效的解決方案。
English
Recently, multi-view or 4D video generation has emerged as a significant
research topic. Nonetheless, recent approaches to 4D generation still struggle
with fundamental limitations, as they primarily rely on harnessing multiple
video diffusion models with additional training or compute-intensive training
of a full 4D diffusion model with limited real-world 4D data and large
computational costs. To address these challenges, here we propose the first
training-free 4D video generation method that leverages the off-the-shelf video
diffusion models to generate multi-view videos from a single input video. Our
approach consists of two key steps: (1) By designating the edge frames in the
spatio-temporal sampling grid as key frames, we first synthesize them using a
video diffusion model, leveraging a depth-based warping technique for guidance.
This approach ensures structural consistency across the generated frames,
preserving spatial and temporal coherence. (2) We then interpolate the
remaining frames using a video diffusion model, constructing a fully populated
and temporally coherent sampling grid while preserving spatial and temporal
consistency. Through this approach, we extend a single video into a multi-view
video along novel camera trajectories while maintaining spatio-temporal
consistency. Our method is training-free and fully utilizes an off-the-shelf
video diffusion model, offering a practical and effective solution for
multi-view video generation.Summary
AI-Generated Summary