인과-rCM: 스트리밍 비디오 생성 및 상호작용 세계 모델에서의 자기회귀 확산 증류를 위한 통합 교사 강제 및 자기 강제 공개 레시피

초록

인과 확산 트랜스포머를 사용한 자기회귀 비디오 확산은 실시간 스트리밍 비디오 생성 및 행동 조건부 상호작용 월드 모델을 위한 주요 패러다임으로 부상했습니다. 본 연구에서는 고급 확산 증류 프레임워크인 rCM을 자기회귀 비디오 확산으로 확장합니다. rCM의 핵심 철학은 확산 증류에서 각각 일관성 모델(CM)과 분포 정합 증류(DMD)로 대표되는 순방향 발산과 역방향 발산 간의 상보성에 있습니다. 이 철학은 자연스럽게 자기회귀 설정으로 이어지며, 여기서 교사 강제(TF)는 오프라인 순방향 발산 인과 훈련 패러다임을 제공하고, 자기 강제(SF)는 온-폴리시 역방향 발산 정제에 해당합니다. 본 연구의 기여는 다음과 같습니다: (1) 광범위한 실험을 통해 교사 강제 CM이 자기 강제 DMD에 대한 최적의 보완 초기화 전략임을 입증하였습니다. (2) 사용자 맞춤형 마스크 FlashAttention-2 JVP 커널을 통해 자기회귀 비디오 확산을 위한 교사 강제 기반 연속 시간 CM(예: sCM/MeanFlow)을 최초로 구현하여 이산 시간 CM(dCM) 대비 10배 빠른 수렴을 달성했습니다. (3) 확산 증류 및 인과 훈련을 위한 선도적이고 통합된 확장 가능한 알고리즘-인프라 오픈 레시피인 Causal-rCM을 소개합니다. (4) 훈련에 합성 데이터만 사용하여 프레임 단위 및 청크 단위 설정 모두에서 최첨단 스트리밍 비디오 생성 성능을 달성했습니다. 특히, 증류된 2단계 인과 Wan2.1-1.3B 모델은 단 1~2회의 샘플링 단계만으로 VBench-T2V 점수 84.63을 달성했습니다. 또한 Causal-rCM을 물리적 AI를 위한 고급 전모달 월드 기반 모델인 Cosmos 3에 적용하여 행동 조건부 생성 능력을 갖춘 상호작용 월드 모델을 가능하게 했습니다.

English

Autoregressive video diffusion with causal diffusion transformers has emerged as a major paradigm for real-time streaming video generation and action-conditioned interactive world models. In this work, we extend rCM, an advanced diffusion distillation framework, to autoregressive video diffusion. The core philosophy of rCM lies in the complementarity between forward and reverse divergences, represented by consistency models (CMs) and distribution matching distillation (DMD), respectively, in diffusion distillation. This philosophy naturally carries over to the autoregressive setting, where teacher-forcing (TF) provides an offline, forward-divergence causal training paradigm, while self-forcing (SF) corresponds to an on-policy, reverse-divergence refinement. Our contributions are: (1) through extensive experiments, we show that teacher-forcing CM is currently the best complement to self-forcing DMD as an initialization strategy (2) we present the first implementation of teacher-forcing-based continuous-time CMs (e.g., sCM/MeanFlow) for autoregressive video diffusion, enabled by our custom-mask FlashAttention-2 JVP kernel, achieving 10times faster convergence compared to discrete-time CMs (dCMs) (3) we introduce Causal-rCM, a leading, unified, and scalable algorithm-infrastructure open recipe for diffusion distillation and causal training (4) we achieve state-of-the-art streaming video generation performance in both frame-wise and chunk-wise settings, using only synthetic data for training. Notably, our distilled 2-step causal Wan2.1-1.3B model achieves a VBench-T2V score of 84.63 with only 1 or 2 sampling steps. We further apply Causal-rCM to Cosmos 3, an advanced omnimodal world foundation model for physical AI with action-conditioned generation capability, enabling an interactive world model.