在文本到視頻擴散模型中定制運動
Customizing Motion in Text-to-Video Diffusion Models
December 7, 2023
作者: Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell
cs.AI
摘要
我們提出了一種方法,用於增強文本到視頻生成模型的自定義動作,擴展其能力以超越原始訓練數據中所描述的動作。通過利用少量展示特定動作的視頻樣本作為輸入,我們的方法學習並概括了用於各種文本指定情境的輸入動作模式。我們的貢獻有三個方面。首先,為了實現我們的結果,我們微調現有的文本到視頻模型,學習在輸入示例中所描述的動作與新的獨特標記之間的新映射。為了避免對新的自定義動作過度擬合,我們引入了一種視頻上的正則化方法。其次,通過利用預訓練模型中的運動先驗,我們的方法可以生成展示多人進行自定義動作的新視頻,並可以將該動作與其他動作結合起來。此外,我們的方法擴展到對運動和外觀進行多模態自定義,使得能夠生成展示獨特角色和不同動作的視頻。第三,為了驗證我們的方法,我們提出了一種方法來定量評估學習到的自定義動作並進行系統性消融研究。我們展示了,當擴展到動作自定義任務時,我們的方法在外觀為基礎的自定義方法方面明顯優於先前的方法。
English
We introduce an approach for augmenting text-to-video generation models with
customized motions, extending their capabilities beyond the motions depicted in
the original training data. By leveraging a few video samples demonstrating
specific movements as input, our method learns and generalizes the input motion
patterns for diverse, text-specified scenarios. Our contributions are
threefold. First, to achieve our results, we finetune an existing text-to-video
model to learn a novel mapping between the depicted motion in the input
examples to a new unique token. To avoid overfitting to the new custom motion,
we introduce an approach for regularization over videos. Second, by leveraging
the motion priors in a pretrained model, our method can produce novel videos
featuring multiple people doing the custom motion, and can invoke the motion in
combination with other motions. Furthermore, our approach extends to the
multimodal customization of motion and appearance of individualized subjects,
enabling the generation of videos featuring unique characters and distinct
motions. Third, to validate our method, we introduce an approach for
quantitatively evaluating the learned custom motion and perform a systematic
ablation study. We show that our method significantly outperforms prior
appearance-based customization approaches when extended to the motion
customization task.