自己強制：自己回帰型ビデオ拡散における訓練-テストギャップの架け橋

要旨

自己強制（Self Forcing）を導入します。これは、自己回帰型ビデオ拡散モデルのための新しいトレーニングパラダイムです。これにより、グラウンドトゥルースのコンテキストでトレーニングされたモデルが、推論時に自身の不完全な出力に基づいてシーケンスを生成しなければならないという、長年の問題である「エクスポージャーバイアス」に対処します。従来の方法とは異なり、グラウンドトゥルースのコンテキストフレームに基づいて将来のフレームをノイズ除去するのではなく、自己強制は、トレーニング中にキー・バリュー（KV）キャッシュを用いた自己回帰的ロールアウトを実行することで、各フレームの生成を以前に自己生成された出力に基づいて条件付けます。この戦略により、ビデオレベルでの包括的な損失を通じて監督が可能となり、従来のフレーム単位の目的関数に頼るのではなく、生成されたシーケンス全体の品質を直接評価します。トレーニング効率を確保するために、数ステップの拡散モデルと確率的勾配打ち切り戦略を採用し、計算コストとパフォーマンスのバランスを効果的に取ります。さらに、効率的な自己回帰型ビデオ外挿を可能にするローリングKVキャッシュメカニズムを導入します。広範な実験により、私たちのアプローチが、単一のGPU上でサブ秒のレイテンシでリアルタイムのストリーミングビデオ生成を実現し、大幅に遅く非因果的な拡散モデルの生成品質に匹敵するか、それを上回ることが示されています。プロジェクトウェブサイト: http://self-forcing.github.io/

English

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

自己強制：自己回帰型ビデオ拡散における訓練-テストギャップの架け橋

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

要旨

Support