基于解耦身份与运动的主体驱动视频生成

摘要

我们提出了一种无需额外调优即可训练主题驱动的定制视频生成模型的方法，通过将特定主题的学习与时间动态解耦来实现零样本学习。传统的免调优视频定制方法通常依赖于大规模标注的视频数据集，这些数据集计算成本高昂且需要大量标注。与以往方法不同，我们直接利用图像定制数据集来训练视频定制模型，将视频定制分解为两个层面：(1) 通过图像定制数据集进行身份注入，(2) 通过图像到视频的训练方法，利用少量未标注视频保持时间建模。此外，在图像到视频的微调过程中，我们采用随机图像令牌丢弃与随机图像初始化相结合的策略，以缓解复制粘贴问题。为了进一步增强学习效果，我们在特定主题特征与时间特征的联合优化中引入了随机切换机制，有效避免了灾难性遗忘。我们的方法在零样本设置下实现了优异的主体一致性和可扩展性，超越了现有的视频定制模型，充分证明了该框架的有效性。

English

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

基于解耦身份与运动的主体驱动视频生成

Subject-driven Video Generation via Disentangled Identity and Motion

摘要

Support