分解梦想者:使用有限和低质量数据训练高质量视频生成器
Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data
August 19, 2024
作者: Tao Yang, Yangming Shi, Yunwen Huang, Feng Chen, Yin Zheng, Lei Zhang
cs.AI
摘要
文本到视频(T2V)生成因其在视频生成、编辑、增强和翻译等领域的广泛应用而受到重视。然而,高质量(HQ)视频合成极具挑战性,因为现实世界中存在多样且复杂的运动。大多数现有作品难以解决这一问题,因为它们需要收集大规模的HQ视频,而这些视频对社区来说是无法获取的。在这项工作中,我们展示了公开可用的有限和低质量(LQ)数据足以训练一个HQ视频生成器,无需重新标注或微调。我们将整个T2V生成过程分解为两个步骤:生成一个受高度描述性字幕条件的图像,以及根据生成的图像和简明的运动细节字幕合成视频。具体来说,我们提出了Factorized-Dreamer,这是一个分解的时空框架,具有几个关键设计用于T2V生成,包括一个适配器用于结合文本和图像嵌入、一个像素感知的交叉注意力模块用于捕获像素级图像信息、一个T5文本编码器用于更好地理解运动描述,以及一个PredictNet用于监督光流。我们进一步提出了一个噪声调度,它在确保视频生成的质量和稳定性方面发挥关键作用。我们的模型降低了对详细字幕和HQ视频的要求,可以直接在有限的LQ数据集上进行训练,这些数据集具有嘈杂且简短的字幕,如WebVid-10M,大大减轻了收集大规模HQ视频文本对的成本。在各种T2V和图像到视频生成任务中进行了大量实验,证明了我们提出的Factorized-Dreamer的有效性。我们的源代码可在https://github.com/yangxy/Factorized-Dreamer/ 上获取。
English
Text-to-video (T2V) generation has gained significant attention due to its
wide applications to video generation, editing, enhancement and translation,
\etc. However, high-quality (HQ) video synthesis is extremely challenging
because of the diverse and complex motions existed in real world. Most existing
works struggle to address this problem by collecting large-scale HQ videos,
which are inaccessible to the community. In this work, we show that publicly
available limited and low-quality (LQ) data are sufficient to train a HQ video
generator without recaptioning or finetuning. We factorize the whole T2V
generation process into two steps: generating an image conditioned on a highly
descriptive caption, and synthesizing the video conditioned on the generated
image and a concise caption of motion details. Specifically, we present
Factorized-Dreamer, a factorized spatiotemporal framework with several
critical designs for T2V generation, including an adapter to combine text and
image embeddings, a pixel-aware cross attention module to capture pixel-level
image information, a T5 text encoder to better understand motion description,
and a PredictNet to supervise optical flows. We further present a noise
schedule, which plays a key role in ensuring the quality and stability of video
generation. Our model lowers the requirements in detailed captions and HQ
videos, and can be directly trained on limited LQ datasets with noisy and brief
captions such as WebVid-10M, largely alleviating the cost to collect
large-scale HQ video-text pairs. Extensive experiments in a variety of T2V and
image-to-video generation tasks demonstrate the effectiveness of our proposed
Factorized-Dreamer. Our source codes are available at
https://github.com/yangxy/Factorized-Dreamer/.Summary
AI-Generated Summary