ChatPaper.aiChatPaper

VD3D:驯服大型视频扩散变压器以实现3D摄像头控制

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

July 17, 2024
作者: Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
cs.AI

摘要

现代文本到视频合成模型展示了从文本描述生成复杂视频的连贯、逼真的能力。然而,大多数现有模型缺乏对摄像机运动的精细控制,这对于与内容创作、视觉效果和3D视觉相关的下游应用至关重要。最近,新方法展示了生成具有可控摄像机姿态的视频的能力,这些技术利用了预训练的基于U-Net的扩散模型,明确地解耦了空间和时间生成。然而,目前还没有现有方法能够实现对新的基于Transformer的视频扩散模型进行摄像机控制,这些模型同时处理空间和时间信息。在这里,我们提出利用类似ControlNet的条件机制来驯服视频Transformer,以实现3D摄像机控制,该机制结合了基于Plucker坐标的时空摄像机嵌入。该方法在RealEstate10K数据集上微调后展示了可控视频生成的最新性能。据我们所知,我们的工作是第一个为基于Transformer的视频扩散模型实现摄像机控制的工作。
English
Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.

Summary

AI-Generated Summary

PDF133November 28, 2024