使用扩散模型生成逼真视频

摘要

我们提出了一种基于Transformer的方法W.A.L.T，通过扩散建模实现逼真视频生成。我们的方法有两个关键设计决策。首先，我们使用因果编码器来联合压缩图像和视频到统一的潜在空间，实现跨模态的训练和生成。其次，为了记忆和训练效率，我们使用了一个专为联合空间和时空生成建模定制的窗口注意力架构。这些设计决策共同使我们能够在已建立的视频（UCF-101和Kinetics-600）和图像（ImageNet）生成基准上取得最先进的性能，而无需使用分类器的指导。最后，我们还训练了一个由三个模型级联组成的模型，用于文本到视频生成任务，包括基础潜在视频扩散模型和两个视频超分辨率扩散模型，以在每秒8帧的情况下生成分辨率为512乘以896的视频。

English

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 times 896 resolution at 8 frames per second.

使用扩散模型生成逼真视频

Photorealistic Video Generation with Diffusion Models

摘要

Support