AAD-1: 非対称敵対的蒸留による一段階自己回帰動画生成

要旨

我々は、一段階自己回帰画像から動画への生成のための非対称敵対的蒸留フレームワーク、AAD-1を提案する。最先端手法は敵対的蒸留を採用するが、動作崩壊や訓練の不安定性に悩まされ、静止した動画を生じる。AAD-1は、アーキテクチャと訓練戦略における2つの主要な設計により、これらの課題に対処する。アーキテクチャ上の鍵となる洞察は、生成器と識別器の間の対称性を破ることである。生成器は自己回帰サンプリング能力を維持するために因果的であり続ける一方、識別器は時空間コンテキスト全体にわたって双方向に注目し、動画シーケンス全体に対して単一の全体的なリアリズムスコアを生成する。この非対称な設計により、識別器は自己回帰生成における動作崩壊の原因となる大域的な時間的失敗や長距離ドリフトを効果的に検出できる。訓練を安定させるために、まず分布マッチングを使用して安定した一段階生成器をブートストラップし、敵対的蒸留が始まる前に学生分布を教師分布に近づけるウォームアップフェーズを提供する段階的戦略を導入する。VBenchにおける広範な実験により、AAD-1が一段階自己回帰動画生成において最先端の性能を達成することを実証する。

English

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.