在文本到視頻擴散模型中定制運動

摘要

我們提出了一種方法，用於增強文本到視頻生成模型的自定義動作，擴展其能力以超越原始訓練數據中所描述的動作。通過利用少量展示特定動作的視頻樣本作為輸入，我們的方法學習並概括了用於各種文本指定情境的輸入動作模式。我們的貢獻有三個方面。首先，為了實現我們的結果，我們微調現有的文本到視頻模型，學習在輸入示例中所描述的動作與新的獨特標記之間的新映射。為了避免對新的自定義動作過度擬合，我們引入了一種視頻上的正則化方法。其次，通過利用預訓練模型中的運動先驗，我們的方法可以生成展示多人進行自定義動作的新視頻，並可以將該動作與其他動作結合起來。此外，我們的方法擴展到對運動和外觀進行多模態自定義，使得能夠生成展示獨特角色和不同動作的視頻。第三，為了驗證我們的方法，我們提出了一種方法來定量評估學習到的自定義動作並進行系統性消融研究。我們展示了，當擴展到動作自定義任務時，我們的方法在外觀為基礎的自定義方法方面明顯優於先前的方法。

English

We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

在文本到視頻擴散模型中定制運動

Customizing Motion in Text-to-Video Diffusion Models

摘要

Support