无需训练的摄像头控制用于视频生成
Training-free Camera Control for Video Generation
June 14, 2024
作者: Chen Hou, Guoqiang Wei, Yan Zeng, Zhibo Chen
cs.AI
摘要
我们提出了一种无需训练且稳健的解决方案,为现成视频扩散模型提供摄像机运动控制。与先前的工作不同,我们的方法不需要在摄像机标注数据集上进行监督微调或通过数据增强进行自监督训练。相反,它可以与大多数预训练视频扩散模型连接并使用单个图像或文本提示生成可控摄像机的视频。我们工作的灵感来自中间潜变量对生成结果的布局先验,因此重新排列其中的噪声像素将使输出内容重新分配。由于摄像机移动也可以看作是由透视变化引起的一种像素重新排列,如果它们的噪声潜变量相应地改变,视频也可以根据特定的摄像机运动重新组织。基于此,我们提出了我们的方法CamTrol,实现了对视频扩散模型的稳健摄像机控制。这是通过两阶段过程实现的。首先,我们通过在3D点云空间中的显式摄像机移动来建模图像布局重新排列。其次,我们使用由一系列重新排列的图像形成的噪声潜变量的布局先验生成具有摄像机运动的视频。大量实验证明了我们的方法在控制生成视频的摄像机运动方面所具有的稳健性。此外,我们展示了我们的方法在生成具有动态内容的3D旋转视频方面能够产生令人印象深刻的结果。项目页面位于https://lifedecoder.github.io/CamTrol/。
English
We propose a training-free and robust solution to offer camera movement
control for off-the-shelf video diffusion models. Unlike previous work, our
method does not require any supervised finetuning on camera-annotated datasets
or self-supervised training via data augmentation. Instead, it can be plugged
and played with most pretrained video diffusion models and generate camera
controllable videos with a single image or text prompt as input. The
inspiration of our work comes from the layout prior that intermediate latents
hold towards generated results, thus rearranging noisy pixels in them will make
output content reallocated as well. As camera move could also be seen as a kind
of pixel rearrangement caused by perspective change, videos could be
reorganized following specific camera motion if their noisy latents change
accordingly. Established on this, we propose our method CamTrol, which enables
robust camera control for video diffusion models. It is achieved by a two-stage
process. First, we model image layout rearrangement through explicit camera
movement in 3D point cloud space. Second, we generate videos with camera motion
using layout prior of noisy latents formed by a series of rearranged images.
Extensive experiments have demonstrated the robustness our method holds in
controlling camera motion of generated videos. Furthermore, we show that our
method can produce impressive results in generating 3D rotation videos with
dynamic content. Project page at https://lifedecoder.github.io/CamTrol/.Summary
AI-Generated Summary