自定义视频:利用多个主题定制文本到视频生成
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects
January 18, 2024
作者: Zhao Wang, Aoxue Li, Enze Xie, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li
cs.AI
摘要
定制化文本到视频生成旨在通过文本提示和主题参考生成高质量视频。目前针对单个主题设计的方法在处理多个主题时存在困难,这是一个更具挑战性和实际的场景。在这项工作中,我们旨在推动多主题引导的文本到视频定制化。我们提出了CustomVideo,这是一个新颖的框架,可以在多个主题的指导下生成保持身份的视频。具体而言,首先,我们通过将多个主题组合在单个图像中来促进多个主题的共现。此外,在基本文本到视频扩散模型的基础上,我们设计了一种简单而有效的注意力控制策略,以在扩散模型的潜在空间中解开不同主题。此外,为了帮助模型专注于特定对象区域,我们从给定的参考图像中分割对象,并为注意力学习提供相应的对象蒙版。此外,我们收集了一个多主题文本到视频生成数据集作为一个全面的基准,其中包含69个单独的主题和57个有意义的配对。广泛的定性、定量和用户研究结果表明,与先前的最先进方法相比,我们的方法具有显著优势。
English
Customized text-to-video generation aims to generate high-quality videos
guided by text prompts and subject references. Current approaches designed for
single subjects suffer from tackling multiple subjects, which is a more
challenging and practical scenario. In this work, we aim to promote
multi-subject guided text-to-video customization. We propose CustomVideo, a
novel framework that can generate identity-preserving videos with the guidance
of multiple subjects. To be specific, firstly, we encourage the co-occurrence
of multiple subjects via composing them in a single image. Further, upon a
basic text-to-video diffusion model, we design a simple yet effective attention
control strategy to disentangle different subjects in the latent space of
diffusion model. Moreover, to help the model focus on the specific object area,
we segment the object from given reference images and provide a corresponding
object mask for attention learning. Also, we collect a multi-subject
text-to-video generation dataset as a comprehensive benchmark, with 69
individual subjects and 57 meaningful pairs. Extensive qualitative,
quantitative, and user study results demonstrate the superiority of our method,
compared with the previous state-of-the-art approaches.