MarDini: スケールでのビデオ生成のためのマスクされた自己回帰拡散

要旨

MarDiniは、マスク付き自己回帰（MAR）の利点を統合した統一された拡散モデル（DM）フレームワークを導入する新しいビデオ拡散モデルファミリーです。ここでは、MARが時間的な計画を処理し、DMが非対称ネットワーク設計における空間生成に焦点を当てます。i）ほとんどのパラメータを含むMARベースの計画モデルは、低解像度の入力を使用して各マスク付きフレームのための計画信号を生成します。ii）軽量な生成モデルは、これらの信号を使用して拡散除去を介して高解像度フレームを生成します。MarDiniのMARは、任意のマスク付きフレームの数やフレーム位置に条件付けられたビデオ生成を可能にします：単一のモデルでビデオ補間（例：中間フレームのマスキング）、画像からビデオへの生成（例：2番目のフレーム以降のマスキング）、およびビデオ拡張（例：フレームの半分のマスキング）を処理できます。効率的な設計は、計算リソースの大部分を低解像度の計画モデルに割り当て、計算コストがかかるが重要な空間的時間的注意を規模で実現可能にします。MarDiniは、ビデオ補間の最先端を確立し、一方で、わずかな推論ステップ内で、より高価な高度な画像からビデオへのモデルと同等の効率でビデオを生成します。

English

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.