VD3D:馴服大型視訊擴散轉換器以用於3D攝影機控制
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control
July 17, 2024
作者: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
cs.AI
摘要
現代文本到視頻合成模型展示了從文本描述中生成複雜視頻的連貫、逼真的能力。然而,大多數現有模型缺乏對攝影機運動的精細控制,這對於與內容創作、視覺效果和3D視覺相關的下游應用至關重要。最近,新方法展示了生成具有可控攝影機姿勢的視頻的能力,這些技術利用了事先訓練的基於U-Net的擴散模型,明確地解開了空間和時間生成。然而,目前還沒有現有方法能夠為處理空間和時間信息的新基於Transformer的視頻擴散模型實現攝影機控制。在這裡,我們提出利用類似ControlNet的條件機制來馴服視頻Transformer,以實現3D攝影機控制,該機制包含基於Plucker坐標的時空攝影機嵌入。該方法在對RealEstate10K數據集進行微調後展示了可控視頻生成的最新性能。據我們所知,我們的工作是第一個為基於Transformer的視頻擴散模型實現攝影機控制的研究。
English
Modern text-to-video synthesis models demonstrate coherent, photorealistic
generation of complex videos from a text description. However, most existing
models lack fine-grained control over camera movement, which is critical for
downstream applications related to content creation, visual effects, and 3D
vision. Recently, new methods demonstrate the ability to generate videos with
controllable camera poses these techniques leverage pre-trained U-Net-based
diffusion models that explicitly disentangle spatial and temporal generation.
Still, no existing approach enables camera control for new, transformer-based
video diffusion models that process spatial and temporal information jointly.
Here, we propose to tame video transformers for 3D camera control using a
ControlNet-like conditioning mechanism that incorporates spatiotemporal camera
embeddings based on Plucker coordinates. The approach demonstrates
state-of-the-art performance for controllable video generation after
fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our
work is the first to enable camera control for transformer-based video
diffusion models.Summary
AI-Generated Summary