LAVIE：使用級聯潛在擴散模型生成高質量視頻

摘要

本研究旨在通過利用預訓練的文本到圖像（T2I）模型作為基礎，學習一個高質量的文本到視頻（T2V）生成模型。在同時實現視覺逼真和時間上連貫的視頻合成以及保留預訓練T2I模型強大的創造性生成能力的過程中，這是一個極具吸引力但具有挑戰性的任務。為此，我們提出了LaVie，一個集成的視頻生成框架，它基於級聯視頻潛在擴散模型，包括基礎T2V模型、時間插值模型和視頻超分辨率模型。我們的關鍵見解有兩個方面：1）我們發現，將簡單的時間自注意力與旋轉位置編碼相結合，能夠充分捕捉視頻數據中固有的時間相關性。2）此外，我們驗證了聯合圖像-視頻微調過程在產生高質量和具有創意的結果中發揮了關鍵作用。為了增強LaVie的性能，我們貢獻了一個名為Vimeo25M的全面多樣的視頻數據集，其中包含2500萬個文本-視頻對，注重質量、多樣性和美感。大量實驗表明，LaVie在定量和定性上均實現了最先進的性能。此外，我們展示了預訓練LaVie模型在各種長視頻生成和個性化視頻合成應用中的多功能性。

English

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

LAVIE：使用級聯潛在擴散模型生成高質量視頻

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

摘要

Support