MotionClone:無需訓練的運動克隆,用於可控視頻生成
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
June 8, 2024
作者: Pengyang Ling, Jiazi Bu, Pan Zhang, Xiaoyi Dong, Yuhang Zang, Tong Wu, Huaian Chen, Jiaqi Wang, Yi Jin
cs.AI
摘要
基於運動控制的文本到視頻生成涉及運動來控制視頻生成。先前的方法通常需要訓練模型來編碼運動線索或微調視頻擴散模型。然而,這些方法在應用於訓練領域之外時,通常會導致次優的運動生成。在這項工作中,我們提出了MotionClone,一個無需訓練的框架,可以從參考視頻中克隆運動,以控制文本到視頻生成。我們在視頻反演中使用時間注意力來表示參考視頻中的運動,並引入主要的時間注意力指導,以減輕注意力權重中嘈雜或非常微妙運動的影響。此外,為了幫助生成模型合成合理的空間關係並增強其及時跟隨能力,我們提出了一種位置感知語義引導機制,利用參考視頻中前景的粗略位置和原始無分類器引導特徵來引導視頻生成。大量實驗表明,MotionClone 在全局相機運動和局部對象運動方面表現出色,具有顯著的優越性,包括運動保真度、文本對齊和時間一致性。
English
Motion-based controllable text-to-video generation involves motions to
control the video generation. Previous methods typically require the training
of models to encode motion cues or the fine-tuning of video diffusion models.
However, these approaches often result in suboptimal motion generation when
applied outside the trained domain. In this work, we propose MotionClone, a
training-free framework that enables motion cloning from a reference video to
control text-to-video generation. We employ temporal attention in video
inversion to represent the motions in the reference video and introduce primary
temporal-attention guidance to mitigate the influence of noisy or very subtle
motions within the attention weights. Furthermore, to assist the generation
model in synthesizing reasonable spatial relationships and enhance its
prompt-following capability, we propose a location-aware semantic guidance
mechanism that leverages the coarse location of the foreground from the
reference video and original classifier-free guidance features to guide the
video generation. Extensive experiments demonstrate that MotionClone exhibits
proficiency in both global camera motion and local object motion, with notable
superiority in terms of motion fidelity, textual alignment, and temporal
consistency.Summary
AI-Generated Summary