Causal Forcing++: 확장 가능한 소수 단계 자기회귀 확산 증류를 통한 실시간 상호작용 비디오 생성

초록

실시간 대화형 비디오 생성은 낮은 지연 시간, 스트리밍, 그리고 제어 가능한 롤아웃을 요구한다. 기존의 자기회귀(AR) 확산 증류 방법은 양방향 기반 모델을 소수 스텝 AR 학생 모델로 증류하여 청크 단위 4-스텝 체계에서 강력한 결과를 달성했지만, 여전히 거친 응답 세분성과 무시할 수 없는 샘플링 지연 시간에 의해 제한된다. 본 논문에서는 더 공격적인 설정인 단 1-2 샘플링 스텝으로 프레임 단위 자기회귀를 연구한다. 이 체계에서 소수 스텝 AR 학생 모델의 초기화가 핵심 병목임을 확인한다. 기존 전략은 목표 정렬이 잘못되었거나, 소수 스텝 생성을 수행할 수 없거나, 확장 비용이 너무 높다. 우리는 인과 일관성 증류(causal CD)를 사용하여 소수 스텝 AR 초기화를 위한 원칙적이고 확장 가능한 파이프라인인 Causal Forcing++을 제안한다. 핵심 아이디어는 인과 CD가 인과 ODE 증류와 동일한 AR 조건부 흐름 맵을 학습하지만, 인접한 시간 스텝 간의 단일 온라인 교사 ODE 스텝에서 감독을 얻어 전체 PF-ODE 궤적을 사전 계산 및 저장할 필요를 피한다는 것이다. 이는 초기화를 더 효율적이고 최적화하기 쉽게 만든다. 결과 파이프라인인 \ours는 **프레임 단위 2-스텝 설정**에서 최첨단 4-스텝 청크 단위 Causal Forcing을 VBench Total에서 0.1, VBench Quality에서 0.3, VisionReward에서 0.335만큼 능가하면서 첫 프레임 지연 시간을 50% 줄이고 Stage 2 훈련 비용을 약 4배 절감한다. 또한 파이프라인을 Genie3의 정신에 따라 행동 조건부 월드 모델 생성으로 확장한다. 프로젝트 페이지: https://github.com/thu-ml/Causal-Forcing 및 https://github.com/shengshu-ai/minWM .

English

Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .