通过轨迹分布匹配学习少步扩散模型
Learning Few-Step Diffusion Models by Trajectory Distribution Matching
March 9, 2025
作者: Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang
cs.AI
摘要
加速扩散模型采样对于高效部署AIGC至关重要。尽管基于分布匹配和轨迹匹配的扩散蒸馏方法将采样步骤减少至仅需一步,但在复杂任务如文本到图像生成上仍显不足。少步生成在速度与质量间提供了更好的平衡,但现有方法面临一个持续的权衡:分布匹配在多步采样中缺乏灵活性,而轨迹匹配往往导致图像质量欠佳。为弥合这一差距,我们提出了通过轨迹分布匹配(TDM)学习少步扩散模型,这是一种结合了分布与轨迹匹配优势的统一蒸馏范式。我们的方法引入了一种无数据分数蒸馏目标,在分布层面上对齐学生与教师的轨迹。此外,我们开发了一种采样步骤感知目标,解耦不同步骤的学习目标,实现更可调的采样。此方法既支持确定性采样以获得更优图像质量,也支持灵活的多步适应,以卓越效率达到最先进性能。我们的模型TDM在多种骨干网络(如SDXL和PixArt-alpha)上均超越现有方法,提供更高质量并显著降低训练成本。特别地,我们的方法将PixArt-alpha蒸馏为一个4步生成器,在1024分辨率下以真实用户偏好超越其教师模型,仅需500次迭代和2个A800小时——仅为教师模型训练成本的0.01%。此外,我们提出的TDM还可扩展用于加速文本到视频扩散。值得注意的是,TDM在VBench上仅使用4次NFE即可超越其教师模型(CogVideoX-2B),将总分从80.91提升至81.65。项目页面:https://tdm-t2x.github.io/
English
Accelerating diffusion model sampling is crucial for efficient AIGC
deployment. While diffusion distillation methods -- based on distribution
matching and trajectory matching -- reduce sampling to as few as one step, they
fall short on complex tasks like text-to-image generation. Few-step generation
offers a better balance between speed and quality, but existing approaches face
a persistent trade-off: distribution matching lacks flexibility for multi-step
sampling, while trajectory matching often yields suboptimal image quality. To
bridge this gap, we propose learning few-step diffusion models by Trajectory
Distribution Matching (TDM), a unified distillation paradigm that combines the
strengths of distribution and trajectory matching. Our method introduces a
data-free score distillation objective, aligning the student's trajectory with
the teacher's at the distribution level. Further, we develop a
sampling-steps-aware objective that decouples learning targets across different
steps, enabling more adjustable sampling. This approach supports both
deterministic sampling for superior image quality and flexible multi-step
adaptation, achieving state-of-the-art performance with remarkable efficiency.
Our model, TDM, outperforms existing methods on various backbones, such as SDXL
and PixArt-alpha, delivering superior quality and significantly reduced
training costs. In particular, our method distills PixArt-alpha into a
4-step generator that outperforms its teacher on real user preference at 1024
resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere
0.01% of the teacher's training cost. In addition, our proposed TDM can be
extended to accelerate text-to-video diffusion. Notably, TDM can outperform its
teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total
score from 80.91 to 81.65. Project page: https://tdm-t2x.github.io/Summary
AI-Generated Summary