ChatPaper.aiChatPaper

MotionClone:无需训练的可控视频生成运动克隆

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

June 8, 2024
作者: Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin
cs.AI

摘要

基于运动的可控文本到视频生成涉及利用运动来控制视频生成。先前的方法通常需要训练模型来编码运动线索或微调视频扩散模型。然而,这些方法在应用于训练领域之外时,往往会导致次优的运动生成。在这项工作中,我们提出了MotionClone,这是一个无需训练的框架,可以实现从参考视频中克隆运动以控制文本到视频的生成。我们在视频反演中采用时间注意力来表示参考视频中的运动,并引入主要的时间注意力指导来减轻注意力权重中嘈杂或非常微妙运动的影响。此外,为了帮助生成模型合成合理的空间关系并增强其及时跟随能力,我们提出了一个位置感知语义指导机制,利用参考视频中前景的粗略位置和原始无分类器指导特征来指导视频生成。大量实验证明,MotionClone 在全局摄像机运动和局部物体运动方面表现出色,具有显著的运动保真度、文本对齐度和时间一致性方面的优势。
English
Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

Summary

AI-Generated Summary

PDF424December 8, 2024