ChatPaper.aiChatPaper

LAVIE:使用級聯潛在擴散模型生成高質量視頻

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

September 26, 2023
作者: Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu
cs.AI

摘要

本研究旨在通過利用預訓練的文本到圖像(T2I)模型作為基礎,學習一個高質量的文本到視頻(T2V)生成模型。在同時實現視覺逼真和時間上連貫的視頻合成以及保留預訓練T2I模型強大的創造性生成能力的過程中,這是一個極具吸引力但具有挑戰性的任務。為此,我們提出了LaVie,一個集成的視頻生成框架,它基於級聯視頻潛在擴散模型,包括基礎T2V模型、時間插值模型和視頻超分辨率模型。我們的關鍵見解有兩個方面:1)我們發現,將簡單的時間自注意力與旋轉位置編碼相結合,能夠充分捕捉視頻數據中固有的時間相關性。2)此外,我們驗證了聯合圖像-視頻微調過程在產生高質量和具有創意的結果中發揮了關鍵作用。為了增強LaVie的性能,我們貢獻了一個名為Vimeo25M的全面多樣的視頻數據集,其中包含2500萬個文本-視頻對,注重質量、多樣性和美感。大量實驗表明,LaVie在定量和定性上均實現了最先進的性能。此外,我們展示了預訓練LaVie模型在各種長視頻生成和個性化視頻合成應用中的多功能性。
English
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.
PDF423December 15, 2024