因果强制：自回归扩散蒸馏技术在高品质实时交互式视频生成中的正确实现

摘要

为实现实时交互式视频生成，当前方法将预训练的双向视频扩散模型蒸馏为少步自回归模型，但在全注意力机制被因果注意力替代时面临架构差异。然而现有方法未从理论上弥合这一差异。它们通过常微分方程蒸馏初始化自回归学生模型，该方法需满足帧级单射性条件——即自回归教师的概率流常微分方程下每个含噪帧必须映射到唯一的清晰帧。从双向教师模型蒸馏自回归学生会违反该条件，导致无法恢复教师的流映射，转而产生条件期望解，从而降低性能。为解决该问题，我们提出因果强制方法，采用自回归教师进行常微分方程初始化，由此弥合架构差异。实验结果表明，本方法在所有指标上均超越基线模型，动态度、视觉奖励和指令跟随分别较SOTA自强制方法提升19.3%、8.7%和16.7%。项目页面与代码：https://thu-ml.github.io/CausalForcing.github.io/

English

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: https://thu-ml.github.io/CausalForcing.github.io/{https://thu-ml.github.io/CausalForcing.github.io/}