ChatPaper.aiChatPaper

因果强制:实现高质量实时交互式视频生成的自回归扩散蒸馏优化方案

Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

February 2, 2026
作者: Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu
cs.AI

摘要

为实现实时交互式视频生成,当前方法将预训练的双向视频扩散模型蒸馏为少步自回归模型,但在全注意力机制被因果注意力替代时面临架构差异。然而现有方法未能从理论上弥合这一差异。它们通过ODE蒸馏初始化自回归学生模型,该方法要求满足帧级单射性条件——即每个含噪帧在自回归教师的概率流常微分方程下必须映射到唯一的清晰帧。从双向教师模型蒸馏自回归学生会违反该条件,导致无法恢复教师的流映射,反而产生条件期望解,从而降低生成性能。为解决此问题,我们提出因果强制方法,采用自回归教师进行ODE初始化,从而弥合架构差异。实验结果表明,本方法在所有指标上均优于基线模型,其中动态度指标超越当前最优的自我强制方法19.3%,视觉奖励指标提升8.7%,指令跟随能力提高16.7%。项目页面与代码地址:https://thu-ml.github.io/CausalForcing.github.io/
English
To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing that uses an AR teacher for ODE initialization, thereby bridging the architectural gap. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page and the code: https://thu-ml.github.io/CausalForcing.github.io/{https://thu-ml.github.io/CausalForcing.github.io/}
PDF232February 7, 2026