확산 모델을 활용한 사실적 비디오 생성

초록

우리는 확산 모델링을 통해 사실적인 비디오 생성을 가능하게 하는 트랜스포머 기반 접근법인 W.A.L.T를 소개한다. 우리의 접근법은 두 가지 주요 설계 결정을 포함한다. 첫째, 이미지와 비디오를 통합된 잠재 공간 내에서 공동으로 압축하기 위해 인과적 인코더를 사용하여 다양한 모달리티 간의 학습과 생성을 가능하게 한다. 둘째, 메모리 및 학습 효율성을 위해 공간적 및 시공간적 생성 모델링을 함께 수행할 수 있도록 설계된 윈도우 어텐션 아키텍처를 사용한다. 이러한 설계 결정을 통해 우리는 분류자 없는 지도 없이도 기존의 비디오(UCF-101 및 Kinetics-600) 및 이미지(ImageNet) 생성 벤치마크에서 최첨단 성능을 달성할 수 있었다. 마지막으로, 텍스트-투-비디오 생성 작업을 위해 기본 잠재 비디오 확산 모델과 두 개의 비디오 초해상도 확산 모델로 구성된 세 가지 모델의 캐스케이드를 학습하여 초당 8프레임의 512x896 해상도 비디오를 생성한다.

English

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 times 896 resolution at 8 frames per second.

확산 모델을 활용한 사실적 비디오 생성

Photorealistic Video Generation with Diffusion Models

초록

Support