DCM：双专家一致性模型，实现高效高质量视频生成

摘要

扩散模型在视频合成领域取得了显著成果，但其依赖迭代去噪步骤，导致计算开销巨大。一致性模型在加速扩散模型方面取得了重要进展。然而，直接将其应用于视频扩散模型往往会导致时间一致性和外观细节的严重退化。本文通过分析一致性模型的训练动态，发现蒸馏过程中存在一个关键的学习动态冲突：不同时间步的优化梯度和损失贡献存在显著差异。这种差异阻碍了蒸馏后的学生模型达到最优状态，从而影响了时间一致性并降低了外观细节质量。为解决这一问题，我们提出了一种参数高效的双专家一致性模型（DCM），其中语义专家专注于学习语义布局和运动，而细节专家则专门负责精细细节的优化。此外，我们引入了时间一致性损失以增强语义专家的运动一致性，并应用GAN和特征匹配损失来提升细节专家的合成质量。我们的方法在显著减少采样步数的同时，实现了最先进的视觉质量，证明了专家分工在视频扩散模型蒸馏中的有效性。代码和模型已公开于https://github.com/Vchitect/DCM。

English

Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient Dual-Expert Consistency Model~(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert.Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation. Our code and models are available at https://github.com/Vchitect/DCM{https://github.com/Vchitect/DCM}.