ChatPaper.aiChatPaper

LAVIE:使用级联潜在扩散模型生成高质量视频

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

September 26, 2023
作者: Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
cs.AI

摘要

本工作旨在通过利用预训练的文本到图像(T2I)模型作为基础,学习一个高质量的文本到视频(T2V)生成模型。在同时实现 a) 合成视觉逼真且时间连贯的视频以及 b) 保留预训练 T2I 模型强大的创造性生成能力的过程中,这是一项极具吸引力但具有挑战性的任务。为此,我们提出了LaVie,一个集成视频生成框架,采用级联视频潜在扩散模型,包括基础T2V模型、时间插值模型和视频超分辨率模型。我们的关键见解有两个方面:1)我们揭示了简单时间自注意力的融合,结合旋转位置编码,足以充分捕捉视频数据中固有的时间相关性。2)此外,我们验证了联合图像-视频微调过程在产生高质量和创造性结果中发挥了关键作用。为增强LaVie的性能,我们贡献了一个名为Vimeo25M的全面多样的视频数据集,包括2500万个文本-视频对,注重质量、多样性和审美吸引力。大量实验证明LaVie在定量和定性上均实现了最先进的性能。此外,我们展示了预训练LaVie模型在各种长视频生成和个性化视频合成应用中的多功能性。
English
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
PDF423December 15, 2024