因果的フォーシング++: リアルタイムインタラクティブ動画生成のためのスケーラブルな数ステップ自己回帰拡散蒸留

要旨

リアルタイムインタラクティブ動画生成には、低遅延、ストリーミング、および制御可能なロールアウトが必要です。既存の自己回帰（AR）拡散蒸留手法は、双方向ベースモデルを少数ステップのAR学生モデルに蒸留することで、チャンク単位の4ステップ設定において強力な結果を達成していますが、粗い応答粒度と無視できないサンプリング遅延に依然として制限されています。本論文では、より積極的な設定、すなわちわずか1～2サンプリングステップによるフレーム単位の自己回帰を研究します。この設定において、少数ステップAR学生モデルの初期化が主要なボトルネックであることを特定しました。既存の戦略は、目標との不一致、少数ステップ生成の不可能性、または拡張のためのコストが高すぎるという問題を抱えています。我々は、少数ステップAR初期化のために因果整合性蒸留（因果CD）を用いる、原理的かつスケーラブルなパイプラインであるCausal Forcing++を提案します。中核的なアイデアは、因果CDが因果ODE蒸留と同じAR条件付きフローマップを学習する一方で、隣接タイムステップ間の単一のオンライン教師ODEステップからの監視を得ることで、完全なPF-ODE軌跡の事前計算と保存を回避するという点です。これにより、初期化がより効率的かつ最適化が容易になります。結果として得られるパイプラインである\oursは、**フレーム単位の2ステップ設定**において、VBench Totalで0.1、VBench Qualityで0.3、VisionRewardで0.335の改善を、最先端の4ステップチャンク単位Causal Forcingに対して達成すると同時に、初フレーム遅延を50%削減し、Stage2の学習コストを約4倍削減します。さらに、このパイプラインをGenie3の精神に基づく行動条件付き世界モデル生成に拡張します。プロジェクトページ：https://github.com/thu-ml/Causal-Forcing および https://github.com/shengshu-ai/minWM 。

English

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .