结合像素和潜在扩散模型进行文本到视频生成

摘要

在大规模预训练文本到视频扩散模型（VDMs）领域取得了重大进展。然而，先前的方法要么仅依赖基于像素的VDMs，这会带来高计算成本，要么依赖基于潜在特征的VDMs，这经常难以实现精确的文本-视频对齐。在本文中，我们首次提出了一个混合模型，命名为Show-1，将基于像素和基于潜在特征的VDMs融合用于文本到视频生成。我们的模型首先使用基于像素的VDMs生成具有强文本-视频相关性的低分辨率视频。之后，我们提出了一种新颖的专家翻译方法，利用基于潜在特征的VDMs进一步将低分辨率视频上采样到高分辨率。与潜在特征VDMs相比，Show-1能够生成具有精确文本-视频对齐的高质量视频；与像素VDMs相比，Show-1更加高效（推理期间GPU内存使用为15G vs 72G）。我们还在标准视频生成基准上验证了我们的模型。我们的代码和模型权重可在https://github.com/showlab/Show-1 上公开获取。

English

Significant advancements have been achieved in the realm of large-scale pre-trained text-to-video Diffusion Models (VDMs). However, previous methods either rely solely on pixel-based VDMs, which come with high computational costs, or on latent-based VDMs, which often struggle with precise text-video alignment. In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation. Our model first uses pixel-based VDMs to produce a low-resolution video of strong text-video correlation. After that, we propose a novel expert translation method that employs the latent-based VDMs to further upsample the low-resolution video to high resolution. Compared to latent VDMs, Show-1 can produce high-quality videos of precise text-video alignment; Compared to pixel VDMs, Show-1 is much more efficient (GPU memory usage during inference is 15G vs 72G). We also validate our model on standard video generation benchmarks. Our code and model weights are publicly available at https://github.com/showlab/Show-1.

结合像素和潜在扩散模型进行文本到视频生成

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

摘要

Support