大規模擴散蒸餾:基於分數正則化的連續時間一致性
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
October 9, 2025
作者: Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
cs.AI
摘要
本研究首次将连续时间一致性蒸馏技术扩展至通用应用级别的图像和视频扩散模型。尽管连续时间一致性模型(sCM)在理论上具有原则性,并在加速学术规模扩散方面展现出实证效力,但由于雅可比向量积(JVP)计算的基础设施挑战及标准评估基准的局限性,其在大规模文本到图像和视频任务中的适用性仍不明确。我们首先开发了一种兼容并行计算的FlashAttention-2 JVP内核,使得sCM能够在超过100亿参数的模型及高维视频任务上进行训练。我们的研究揭示了sCM在精细细节生成方面的根本质量限制,这归因于误差累积及其前向散度目标的“模式覆盖”特性。为弥补此缺陷,我们提出了分数正则化的连续时间一致性模型(rCM),该模型通过引入分数蒸馏作为长跳正则化器,将“模式寻求”的反向散度与sCM相结合,有效提升了视觉质量,同时保持了高生成多样性。在参数高达140亿的Cosmos-Predict2、Wan2.1等大规模模型及5秒视频上的验证表明,rCM在质量指标上匹配或超越了最先进的蒸馏方法DMD2,并在多样性方面展现出显著优势,且无需GAN调优或大量超参数搜索。蒸馏后的模型仅需1至4步即可生成高保真样本,将扩散采样速度提升了15至50倍。这些成果确立了rCM作为一个实用且理论扎实的框架,用于推进大规模扩散蒸馏技术的发展。
English
This work represents the first effort to scale up continuous-time consistency
distillation to general application-level image and video diffusion models.
Although continuous-time consistency model (sCM) is theoretically principled
and empirically powerful for accelerating academic-scale diffusion, its
applicability to large-scale text-to-image and video tasks remains unclear due
to infrastructure challenges in Jacobian-vector product (JVP) computation and
the limitations of standard evaluation benchmarks. We first develop a
parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on
models with over 10 billion parameters and high-dimensional video tasks. Our
investigation reveals fundamental quality limitations of sCM in fine-detail
generation, which we attribute to error accumulation and the "mode-covering"
nature of its forward-divergence objective. To remedy this, we propose the
score-regularized continuous-time consistency model (rCM), which incorporates
score distillation as a long-skip regularizer. This integration complements sCM
with the "mode-seeking" reverse divergence, effectively improving visual
quality while maintaining high generation diversity. Validated on large-scale
models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM
matches or surpasses the state-of-the-art distillation method DMD2 on quality
metrics while offering notable advantages in diversity, all without GAN tuning
or extensive hyperparameter searches. The distilled models generate
high-fidelity samples in only 1sim4 steps, accelerating diffusion sampling
by 15timessim50times. These results position rCM as a practical and
theoretically grounded framework for advancing large-scale diffusion
distillation.