MotionMaster:无需训练的摄像机运动转移用于视频生成
MotionMaster: Training-free Camera Motion Transfer For Video Generation
April 24, 2024
作者: Teng Hu, Jiangning Zhang, Ran Yi, Yating Wang, Hongrui Huang, Jieyu Weng, Yabiao Wang, Lizhuang Ma
cs.AI
摘要
扩散模型的出现极大推动了图像和视频生成的进展。最近,一些工作致力于可控视频生成,包括文本到视频生成和视频运动控制,其中摄像机运动控制是一个重要主题。然而,现有的摄像机运动控制方法依赖于训练一个时间摄像机模块,并且由于视频生成模型中大量参数的存在,需要大量计算资源。此外,现有方法在训练过程中预定义摄像机运动类型,这限制了它们在摄像机控制方面的灵活性。因此,为了降低训练成本并实现灵活的摄像机控制,我们提出了COMD,一种新颖的无需训练的视频运动转移模型,该模型将源视频中的摄像机运动和物体运动进行解耦,并将提取的摄像机运动转移到新视频中。我们首先提出了一种一次性摄像机运动解耦方法,从单个源视频中提取摄像机运动,该方法将移动物体与背景分离,并基于背景中的运动通过求解泊松方程来估计移动物体区域中的摄像机运动。此外,我们提出了一种少样本摄像机运动解耦方法,从具有相似摄像机运动的多个视频中提取共同的摄像机运动,该方法采用基于窗口的聚类技术,从多个视频的时间注意力图中提取共同特征。最后,我们提出了一种运动组合方法,将不同类型的摄像机运动结合在一起,使我们的模型具有更可控和灵活的摄像机控制。大量实验证明,我们的无需训练方法能够有效解耦摄像机-物体运动,并将解耦的摄像机运动应用于各种可控视频生成任务,实现灵活多样的摄像机运动控制。
English
The emergence of diffusion models has greatly propelled the progress in image
and video generation. Recently, some efforts have been made in controllable
video generation, including text-to-video generation and video motion control,
among which camera motion control is an important topic. However, existing
camera motion control methods rely on training a temporal camera module, and
necessitate substantial computation resources due to the large amount of
parameters in video generation models. Moreover, existing methods pre-define
camera motion types during training, which limits their flexibility in camera
control. Therefore, to reduce training costs and achieve flexible camera
control, we propose COMD, a novel training-free video motion transfer model,
which disentangles camera motions and object motions in source videos and
transfers the extracted camera motions to new videos. We first propose a
one-shot camera motion disentanglement method to extract camera motion from a
single source video, which separates the moving objects from the background and
estimates the camera motion in the moving objects region based on the motion in
the background by solving a Poisson equation. Furthermore, we propose a
few-shot camera motion disentanglement method to extract the common camera
motion from multiple videos with similar camera motions, which employs a
window-based clustering technique to extract the common features in temporal
attention maps of multiple videos. Finally, we propose a motion combination
method to combine different types of camera motions together, enabling our
model a more controllable and flexible camera control. Extensive experiments
demonstrate that our training-free approach can effectively decouple
camera-object motion and apply the decoupled camera motion to a wide range of
controllable video generation tasks, achieving flexible and diverse camera
motion control.Summary
AI-Generated Summary