MilliVid: 動画生成における長距離一貫性のための階層的潜在変数

要旨

ビデオ生成モデルはますます強力になっているが、数フレームであっても実用的でないほど長いトランスフォーマー系列長が必要となるため、長期的な一貫性の達成は依然として困難である。本稿では、マルチスケールのトークン空間内で粗密を段階的に展開する（coarse-to-fine rollout）手法によりビデオを生成することで、この問題を緩和できることを示す。我々のアプローチは単純である。まず、各フレームをトークンの階層に圧縮するオートエンコーダを事前学習する。この階層は、一般的な潜在解像度からフレームあたりわずか数個のトークンにまで及ぶ。最も粗いレベルはシーンのレイアウトやセマンティクスといった最も重要な情報を捉え、より細かいレベルは高周波の外観やテクスチャを追加する。次に、粗密を段階的に展開する手法を用いてこれらのトークンを生成するビデオ拡散モデルを学習する。各展開ステップにおいてフレームが生成され、コンテキストとして使用される詳細レベルを注意深く制御することで、幾何学的な長期一貫性と物体の永続性を維持しつつ、知覚的にそれほど重要でない詳細の長期一貫性に費やす計算量を削減できる。我々は、長尺のMinecraftビデオからなるカスタムデータセットを用いてこのアプローチを検証し、既存のベースラインと比較して大幅に一貫性の高い展開結果が得られることを確認した。

English

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.