在文本到视频扩散模型中定制运动

摘要

我们提出了一种方法，用于增强文本到视频生成模型的自定义动作，扩展其能力以超越原始训练数据中描绘的动作。通过利用几个展示特定动作的视频样本作为输入，我们的方法学习并推广输入动作模式，以适用于多样的、文本指定的场景。我们的贡献有三个方面。首先，为了实现我们的结果，我们微调现有的文本到视频模型，学习在输入示例中描绘的动作与新的唯一标记之间的新映射。为了避免过度拟合到新的自定义动作，我们引入了一种对视频进行正则化的方法。其次，通过利用预训练模型中的动作先验，我们的方法可以生成展示多人进行自定义动作的新视频，并可以将该动作与其他动作结合起来。此外，我们的方法扩展到了个性化主体的动作和外观的多模态定制，实现了生成展示独特角色和不同动作的视频。第三，为了验证我们的方法，我们提出了一种定量评估学习到的自定义动作并进行系统的消融研究的方法。我们展示了，当扩展到动作定制任务时，我们的方法在外观为基础的定制方法方面表现显著优于先前的方法。

English

We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

在文本到视频扩散模型中定制运动

Customizing Motion in Text-to-Video Diffusion Models

摘要

Support