VideoGen:一種參考引導的潛在擴散方法,用於高清晰度文本到視頻生成
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation
September 1, 2023
作者: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
cs.AI
摘要
本文介紹了VideoGen,一種文本到視頻生成方法,可以使用參考引導的潛在擴散生成高清晰度視頻,具有高幀保真度和強大的時間一致性。我們利用現成的文本到圖像生成模型,例如Stable Diffusion,從文本提示生成具有高內容質量的圖像,作為引導視頻生成的參考圖像。然後,我們引入了一個高效的級聯潛在擴散模塊,條件是參考圖像和文本提示,用於生成潛在視頻表示,然後通過基於流的時間上採樣步驟來提高時間分辨率。最後,我們通過增強的視頻解碼器將潛在視頻表示映射到高清晰度視頻。在訓練期間,我們使用地面真實視頻的第一幀作為訓練級聯潛在擴散模塊的參考圖像。我們方法的主要特點包括:文本到圖像模型生成的參考圖像提高了視覺保真度;將其用作條件使擴散模型更加專注於學習視頻動態;視頻解碼器在未標記的視頻數據上進行訓練,因此受益於高質量且易於獲得的視頻。在質量和量化評估方面,VideoGen在文本到視頻生成方面設立了新的技術水準。
English
In this paper, we present VideoGen, a text-to-video generation approach,
which can generate a high-definition video with high frame fidelity and strong
temporal consistency using reference-guided latent diffusion. We leverage an
off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to
generate an image with high content quality from the text prompt, as a
reference image to guide video generation. Then, we introduce an efficient
cascaded latent diffusion module conditioned on both the reference image and
the text prompt, for generating latent video representations, followed by a
flow-based temporal upsampling step to improve the temporal resolution.
Finally, we map latent video representations into a high-definition video
through an enhanced video decoder. During training, we use the first frame of a
ground-truth video as the reference image for training the cascaded latent
diffusion module. The main characterises of our approach include: the reference
image generated by the text-to-image model improves the visual fidelity; using
it as the condition makes the diffusion model focus more on learning the video
dynamics; and the video decoder is trained over unlabeled video data, thus
benefiting from high-quality easily-available videos. VideoGen sets a new
state-of-the-art in text-to-video generation in terms of both qualitative and
quantitative evaluation.