分阶段DMD:基于子区间内分数匹配的少步数分布匹配蒸馏
Phased DMD: Few-step Distribution Matching Distillation via Score Matching within Subintervals
October 31, 2025
作者: Xiangyu Fan, Zesong Qiu, Zhuguanyu Wu, Fanzhou Wang, Zhiqian Lin, Tianxiang Ren, Dahua Lin, Ruihao Gong, Lei Yang
cs.AI
摘要
分布匹配蒸馏(DMD)将基于分数的生成模型蒸馏为高效的一步生成器,无需与教师模型的采样轨迹保持一一对应。然而,受限的模型容量导致一步蒸馏模型在复杂生成任务(如文本到视频生成中合成精细物体运动)上表现不佳。直接将DMD扩展为多步蒸馏会增大内存占用和计算深度,导致训练不稳定与效率下降。虽然已有研究提出随机梯度截断作为潜在解决方案,但我们发现这会显著降低多步蒸馏模型的生成多样性,使其降至一步蒸馏模型的水平。为突破这些局限,我们提出分阶段DMD——一种融合分阶段蒸馏与混合专家(MoE)思想的多步蒸馏框架,在降低学习难度的同时提升模型容量。该框架基于两大核心思想:渐进式分布匹配与子区间分数匹配。首先,模型将信噪比范围划分为多个子区间,通过逐步向更高信噪比层级精炼模型,以更好地捕捉复杂分布。其次,为确保每个子区间内训练目标的准确性,我们进行了严谨的数学推导。通过蒸馏包括Qwen-Image(200亿参数)和Wan2.2(280亿参数)在内的前沿图像与视频生成模型,我们验证了分阶段DMD的有效性。实验结果表明,分阶段DMD在保持关键生成能力的同时,比DMD能更好地保留输出多样性。我们将公开代码与模型。
English
Distribution Matching Distillation (DMD) distills score-based generative
models into efficient one-step generators, without requiring a one-to-one
correspondence with the sampling trajectories of their teachers. However,
limited model capacity causes one-step distilled models underperform on complex
generative tasks, e.g., synthesizing intricate object motions in text-to-video
generation. Directly extending DMD to multi-step distillation increases memory
usage and computational depth, leading to instability and reduced efficiency.
While prior works propose stochastic gradient truncation as a potential
solution, we observe that it substantially reduces the generation diversity of
multi-step distilled models, bringing it down to the level of their one-step
counterparts. To address these limitations, we propose Phased DMD, a multi-step
distillation framework that bridges the idea of phase-wise distillation with
Mixture-of-Experts (MoE), reducing learning difficulty while enhancing model
capacity. Phased DMD is built upon two key ideas: progressive distribution
matching and score matching within subintervals. First, our model divides the
SNR range into subintervals, progressively refining the model to higher SNR
levels, to better capture complex distributions. Next, to ensure the training
objective within each subinterval is accurate, we have conducted rigorous
mathematical derivations. We validate Phased DMD by distilling state-of-the-art
image and video generation models, including Qwen-Image (20B parameters) and
Wan2.2 (28B parameters). Experimental results demonstrate that Phased DMD
preserves output diversity better than DMD while retaining key generative
capabilities. We will release our code and models.