MotionClone: 制御可能な動画生成のためのトレーニング不要なモーションクローニング

要旨

モーションベースの制御可能なテキストからビデオ生成は、ビデオ生成を制御するためにモーションを利用する。従来の手法では、モーションの手がかりをエンコードするモデルのトレーニングや、ビデオ拡散モデルのファインチューニングが必要とされることが一般的であった。しかし、これらのアプローチは、トレーニングされた領域外で適用された場合、最適でないモーション生成をもたらすことが多い。本研究では、参照ビデオからモーションをクローンしてテキストからビデオ生成を制御する、トレーニング不要のフレームワークであるMotionCloneを提案する。ビデオインバージョンにおいて時間的アテンションを活用して参照ビデオのモーションを表現し、アテンション重み内のノイズや非常に微妙なモーションの影響を軽減するために、主要な時間的アテンションガイダンスを導入する。さらに、生成モデルが合理的な空間関係を合成し、プロンプト追従能力を強化するのを支援するために、参照ビデオからの前景の大まかな位置と元のクラス分類不要ガイダンス特徴を活用してビデオ生成を導く、位置認識セマンティックガイダンスメカニズムを提案する。広範な実験により、MotionCloneがグローバルなカメラモーションとローカルなオブジェクトモーションの両方において熟練しており、モーションの忠実度、テキストの整合性、時間的一貫性の点で顕著な優位性を示すことが実証された。

English

Motion-based controllable text-to-video generation involves motions to control the video generation. Previous methods typically require the training of models to encode motion cues or the fine-tuning of video diffusion models. However, these approaches often result in suboptimal motion generation when applied outside the trained domain. In this work, we propose MotionClone, a training-free framework that enables motion cloning from a reference video to control text-to-video generation. We employ temporal attention in video inversion to represent the motions in the reference video and introduce primary temporal-attention guidance to mitigate the influence of noisy or very subtle motions within the attention weights. Furthermore, to assist the generation model in synthesizing reasonable spatial relationships and enhance its prompt-following capability, we propose a location-aware semantic guidance mechanism that leverages the coarse location of the foreground from the reference video and original classifier-free guidance features to guide the video generation. Extensive experiments demonstrate that MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.

MotionClone: 制御可能な動画生成のためのトレーニング不要なモーションクローニング

MotionClone: Training-Free Motion Cloning for Controllable Video Generation

要旨

Support