VideoFactory:用于时空扩散的交换注意力在文本到视频生成中的应用
VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation
May 18, 2023
作者: Wenjing Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, Jiaying Liu
cs.AI
摘要
我们提出了VideoFactory,这是一个创新的框架,用于生成高质量的开放领域视频。VideoFactory擅长生成无水印、高清晰度(1376x768)、宽屏(16:9)视频,为用户创造引人入胜的体验。根据文本指导生成视频存在重大挑战,如建模空间和时间之间复杂关系以及缺乏大规模文本-视频配对数据。先前的方法通过为视频生成添加一维卷积/注意力模块来扩展预训练的文本到图像生成模型。然而,这些方法忽视了共同建模空间和时间的重要性,不可避免地导致时间失真和文本与视频之间的不对齐。在本文中,我们提出了一种增强空间和时间感知之间交互作用的新方法。具体来说,我们利用在三维窗口中交换的交叉注意力机制,交替在空间和时间块之间扮演“查询”角色,使彼此能够相互加强。为了充分发挥模型在高质量视频生成方面的能力,我们策划了一个名为HD-VG-130M的大规模视频数据集。该数据集包括来自开放领域的1.3亿个文本-视频配对,确保高清晰度、宽屏和无水印特性。客观指标和用户研究表明,我们的方法在每帧质量、时间相关性和文本-视频对齐方面优势明显。
English
We present VideoFactory, an innovative framework for generating high-quality
open-domain videos. VideoFactory excels in producing high-definition
(1376x768), widescreen (16:9) videos without watermarks, creating an engaging
user experience. Generating videos guided by text instructions poses
significant challenges, such as modeling the complex relationship between space
and time, and the lack of large-scale text-video paired data. Previous
approaches extend pretrained text-to-image generation models by adding temporal
1D convolution/attention modules for video generation. However, these
approaches overlook the importance of jointly modeling space and time,
inevitably leading to temporal distortions and misalignment between texts and
videos. In this paper, we propose a novel approach that strengthens the
interaction between spatial and temporal perceptions. In particular, we utilize
a swapped cross-attention mechanism in 3D windows that alternates the "query"
role between spatial and temporal blocks, enabling mutual reinforcement for
each other. To fully unlock model capabilities for high-quality video
generation, we curate a large-scale video dataset called HD-VG-130M. This
dataset comprises 130 million text-video pairs from the open-domain, ensuring
high-definition, widescreen and watermark-free characters. Objective metrics
and user studies demonstrate the superiority of our approach in terms of
per-frame quality, temporal correlation, and text-video alignment, with clear
margins.