VideoGen：一种基于参考引导的潜在扩散方法，用于高清文本到视频生成

摘要

本文介绍了VideoGen，一种文本到视频生成方法，可以利用参考引导的潜在扩散生成高清晰度视频，具有高帧保真度和强时间一致性。我们利用现成的文本到图像生成模型，例如Stable Diffusion，从文本提示生成内容质量高的图像，作为引导视频生成的参考图像。然后，我们引入了一个高效的级联潜在扩散模块，该模块以参考图像和文本提示为条件，用于生成潜在视频表示，然后通过基于流的时间上采样步骤来提高时间分辨率。最后，我们通过增强的视频解码器将潜在视频表示映射到高清晰度视频中。在训练过程中，我们使用地面真实视频的第一帧作为级联潜在扩散模块的训练参考图像。我们方法的主要特点包括：文本到图像模型生成的参考图像提高了视觉保真度；将其用作条件使扩散模型更加关注学习视频动态；视频解码器在未标记的视频数据上进行训练，从而受益于高质量易获得的视频。在定性和定量评估方面，VideoGen在文本到视频生成领域取得了新的技术水平。

English

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation.

VideoGen：一种基于参考引导的潜在扩散方法，用于高清文本到视频生成

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

摘要

Support