Causal-rCM: ストリーミング動画生成および対話型ワールドモデルのための自己回帰拡散蒸留における統一的Teacher-ForcingとSelf-Forcingのオープンレシピ

要旨

自己回帰的ビデオ拡散と因果拡散トランスフォーマーの組み合わせは、リアルタイムストリーミングビデオ生成やアクション条件付きインタラクティブワールドモデルの主要なパラダイムとして確立されつつある。本研究では、高度な拡散蒸留フレームワークであるrCMを自己回帰的ビデオ拡散に拡張する。rCMの核となる哲学は、拡散蒸留における前方発散と後方発散の相補性にあり、それぞれ整合性モデル（CM）と分布整合蒸留（DMD）によって表現される。この哲学は自己回帰設定にも自然に適用され、教師強制（TF）がオフラインかつ前方発散に基づく因果的訓練パラダイムを提供する一方、自己強制（SF）はオン方策かつ後方発散に基づく洗練手法に対応する。本研究の貢献は以下の通りである。(1) 大規模実験を通じて、教師強制CMが自己強制DMDに対する最適な初期化戦略としての補完性を示す。(2) 我々独自のカスタムマスクFlashAttention-2 JVPカーネルにより、自己回帰的ビデオ拡散のための教師強制ベース連続時間CM（例：sCM/MeanFlow）を初めて実装し、離散時間CM（dCM）と比較して10倍の収束高速化を達成する。(3) 拡散蒸留と因果的訓練のための先導的かつ統一されスケーラブルなアルゴリズム・インフラストラクチャのオープンレシピ「Causal-rCM」を導入する。(4) 訓練に合成データのみを用い、フレーム単位およびチャンク単位の両方の設定で最先端のストリーミングビデオ生成性能を達成する。特筆すべき点として、蒸留された2ステップ因果Wan2.1-1.3Bモデルは、1回または2回のサンプリングステップのみでVBench-T2Vスコア84.63を達成する。さらに、Causal-rCMを、物理AI向けの高度な全方位ワールド基盤モデルであり、アクション条件付き生成機能を備えるCosmos 3に適用し、インタラクティブワールドモデルを実現する。

English

Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10times faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model.