対角線知識蒸留によるストリーミング自己回帰的動画生成

要旨

大規模事前学習拡散モデルは生成動画の品質を大幅に向上させたが、リアルタイムストリーミングへの応用は依然として限られている。自己回帰モデルは連続フレーム合成の自然な枠組みを提供するが、高精細度を達成するには膨大な計算量を要する。拡散蒸留はこれらのモデルを効率的な少数ステップ版に圧縮できるが、既存の動画蒸留手法の多くは時間的依存関係を無視した画像特化の手法を流用している。これらの技術は画像生成では優れるが、動画合成では、運動の一貫性の低下、長系列における誤差蓄積、レイテンシと品質のトレードオフが生じ、十分な性能を発揮しない。我々はこれらの制限をもたらす二要因を特定した：ステップ削減時の時間的文脈の不十分な利用、および次チャンク予測における後続ノイズレベルの暗黙的予測（すなわち、エクスポージャバイアス）である。これらの問題を解決するため、我々は既存手法と直交し、動画チャンクとノイズ除去ステップの両方にわたる時間情報をより効果的に活用する対角線蒸留（Diagonal Distillation）を提案する。本手法の中核は非対称生成戦略、すなわち初期は多ステップ、後期は少ステップで処理することである。この設計により、後続のチャンクは十分に処理された初期チャンクから豊富な外観情報を継承しつつ、部分的にノイズ除去されたチャンクを後続合成の条件入力として利用できる。チャンク生成時の後続ノイズレベルの暗黙的予測を実際の推論条件に整合させることで、誤差伝播を軽減し、長系列における過飽和を低減する。さらに、暗黙的光フローモデリングを組み込み、厳しいステップ制約下でも運動品質を維持する。本手法は5秒の動画を2.61秒（最大31 FPS）で生成し、蒸留前モデル比277.3倍の高速化を実現した。

English

Large pretrained diffusion models have significantly enhanced the quality of generated videos, and yet their use in real-time streaming remains limited. Autoregressive models offer a natural framework for sequential frame synthesis but require heavy computation to achieve high fidelity. Diffusion distillation can compress these models into efficient few-step variants, but existing video distillation approaches largely adapt image-specific methods that neglect temporal dependencies. These techniques often excel in image generation but underperform in video synthesis, exhibiting reduced motion coherence, error accumulation over long sequences, and a latency-quality trade-off. We identify two factors that result in these limitations: insufficient utilization of temporal context during step reduction and implicit prediction of subsequent noise levels in next-chunk prediction (i.e., exposure bias). To address these issues, we propose Diagonal Distillation, which operates orthogonally to existing approaches and better exploits temporal information across both video chunks and denoising steps. Central to our approach is an asymmetric generation strategy: more steps early, fewer steps later. This design allows later chunks to inherit rich appearance information from thoroughly processed early chunks, while using partially denoised chunks as conditional inputs for subsequent synthesis. By aligning the implicit prediction of subsequent noise levels during chunk generation with the actual inference conditions, our approach mitigates error propagation and reduces oversaturation in long-range sequences. We further incorporate implicit optical flow modeling to preserve motion quality under strict step constraints. Our method generates a 5-second video in 2.61 seconds (up to 31 FPS), achieving a 277.3x speedup over the undistilled model.

対角線知識蒸留によるストリーミング自己回帰的動画生成

Streaming Autoregressive Video Generation via Diagonal Distillation

要旨

Support