Causal-rCM：一种统一的教师强制与自强制开放方案，用于流式视频生成与交互式世界模型中的自回归扩散蒸馏

摘要

自回归视频扩散结合因果扩散变换器，已成为实时流媒体视频生成和动作条件交互世界模型的主要范式。在本工作中，我们将先进的扩散蒸馏框架rCM扩展到自回归视频扩散。rCM的核心哲学在于前向散度和反向散度之间的互补性，分别由扩散蒸馏中的一致性模型（CMs）和分布匹配蒸馏（DMD）表示。这一哲学自然延续到自回归设定中，其中教师强制（TF）提供了一种离线、前向散度的因果训练范式，而自强制（SF）则对应一种在线、反向散度的精细化方法。我们的贡献如下：(1) 通过大量实验，我们表明教师强制一致性模型是目前作为自强制DMD初始化策略的最佳补充；(2) 我们首次实现了基于教师强制的连续时间一致性模型（例如sCM/MeanFlow）用于自回归视频扩散，这得益于我们定制的掩码FlashAttention-2 JVP内核，相比离散时间一致性模型（dCM）实现了10倍的收敛加速；(3) 我们提出了Causal-rCM，这是一个领先、统一且可扩展的算法-基础设施开放配方，用于扩散蒸馏和因果训练；(4) 我们在逐帧和逐块两种设定下均取得了最先进的流媒体视频生成性能，且仅使用合成数据进行训练。值得注意的是，我们蒸馏后的2步因果Wan2.1-1.3B模型仅需1或2个采样步骤，即可达到84.63的VBench-T2V评分。我们进一步将Causal-rCM应用于Cosmos 3——一个面向物理AI、具备动作条件生成能力的先进全模态世界基础模型，从而实现了交互式世界模型。

English

Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10times faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model.