ChatPaper.aiChatPaper

透過擴散模型生成逼真影片

Photorealistic Video Generation with Diffusion Models

December 11, 2023
作者: Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama
cs.AI

摘要

我們提出了一種基於Transformer的方法W.A.L.T,用於透過擴散建模實現逼真的視頻生成。我們的方法有兩個關鍵設計決策。首先,我們使用因果編碼器來共同壓縮圖像和視頻到統一的潛在空間,從而實現跨模態的訓練和生成。其次,為了記憶和訓練效率,我們使用了一個針對聯合空間和時空生成建模的窗口注意力架構。這些設計決策使我們能夠在已建立的視頻(UCF-101和Kinetics-600)和圖像(ImageNet)生成基準測試中實現最先進的性能,而無需使用分類器的引導。最後,我們還訓練了三個模型的級聯,用於文本到視頻生成任務,包括基本的潛在視頻擴散模型,以及兩個視頻超分辨率擴散模型,以在每秒8幀的情況下生成512乘以896分辨率的視頻。
English
We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 times 896 resolution at 8 frames per second.
PDF242December 15, 2024