拡散モデルを用いたフォトリアルな映像生成

要旨

W.A.L.Tを紹介します。これは拡散モデリングを用いたフォトリアルなビデオ生成のためのTransformerベースのアプローチです。我々のアプローチには2つの重要な設計上の決定があります。まず、画像とビデオを統一された潜在空間内で共同で圧縮する因果的エンコーダを使用し、モダリティを超えた学習と生成を可能にします。次に、メモリと学習効率のために、空間的および時空間的生成モデリングに特化したウィンドウアテンションアーキテクチャを採用しています。これらの設計上の決定を組み合わせることで、分類器フリーガイダンスを使用せずに、確立されたビデオ（UCF-101およびKinetics-600）および画像（ImageNet）生成ベンチマークで最先端の性能を達成することができます。最後に、テキストからビデオ生成のタスクのために、基本の潜在ビデオ拡散モデルと2つのビデオ超解像拡散モデルからなる3つのモデルのカスケードを訓練し、8フレーム/秒で512×896解像度のビデオを生成します。

English

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 times 896 resolution at 8 frames per second.

拡散モデルを用いたフォトリアルな映像生成

Photorealistic Video Generation with Diffusion Models

要旨

Support