ChatPaper.aiChatPaper

因果強迫++:可擴展的少步自回歸擴散蒸餾,用於即時互動式影片生成

Causal Forcing++: Scalable Few-Step Autoregressive Diffusion Distillation for Real-Time Interactive Video Generation

May 14, 2026
作者: Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu
cs.AI

摘要

即時互動式影片生成需要低延遲、串流及可控的滾動式生成。現有的自回歸擴散蒸餾方法透過將雙向基礎模型蒸餾為少步數自回歸學生模型,在分塊式4步推論框架中取得了優異成果,但其仍受限於粗糙的回應粒度與不可忽略的取樣延遲。本研究探討更具挑戰性的設定:以僅1至2步取樣實現逐幀自回歸。在此框架下,我們發現少步數自回歸學生模型的初始化是關鍵瓶頸:既有策略若非目標不對齊、無法支援少步生成,便是擴充成本過高。我們提出因果強制++(Causal Forcing++),這套具原則性且可擴充的管線利用因果一致性蒸餾(causal CD)進行少步數自回歸初始化。核心概念在於因果CD能學習與因果ODE蒸餾相同的自回歸條件流映射,但僅需從相鄰時間步間的單一線上教師ODE步驟取得監督訊號,無需預先計算並儲存完整的PF-ODE軌跡。這使得初始化既更有效率也更容易最佳化。最終管線 \ours 在**逐幀2步設定**下,VBench總分提升0.1、VBench品質分數提升0.3、VisionReward提升0.335,同時將首幀延遲降低50%,第二階段訓練成本減少約4倍。我們進一步擴展此管線至具動作條件的世界模型生成,遵循Genie3的精神。專案頁面:https://github.com/thu-ml/Causal-Forcing 及 https://github.com/shengshu-ai/minWM。
English
Real-time interactive video generation requires low-latency, streaming, and controllable rollout. Existing autoregressive (AR) diffusion distillation methods have achieved strong results in the chunk-wise 4-step regime by distilling bidirectional base models into few-step AR students, but they remain limited by coarse response granularity and non-negligible sampling latency. In this paper, we study a more aggressive setting: frame-wise autoregression with only 1--2 sampling steps. In this regime, we identify the initialization of a few-step AR student as the key bottleneck: existing strategies are either target-misaligned, incapable of few-step generation, or too costly to scale. We propose Causal Forcing++, a principled and scalable pipeline that uses causal consistency distillation (causal CD) for few-step AR initialization. The core idea is that causal CD learns the same AR-conditional flow map as causal ODE distillation, but obtains supervision from a single online teacher ODE step between adjacent timesteps, avoiding the need to precompute and store full PF-ODE trajectories. This makes the initialization both more efficient and easier to optimize. The resulting pipeline, \ours, surpasses the SOTA 4-step chunk-wise Causal Forcing under the \textbf{frame-wise 2-step setting} by 0.1 in VBench Total, 0.3 in VBench Quality, and 0.335 in VisionReward, while reducing first-frame latency by 50\% and Stage 2 training cost by sim4times. We further extend the pipeline to action-conditioned world model generation in the spirit of Genie3. Project Page: https://github.com/thu-ml/Causal-Forcing and https://github.com/shengshu-ai/minWM .