通過軌跡分佈匹配學習少步擴散模型
Learning Few-Step Diffusion Models by Trajectory Distribution Matching
March 9, 2025
作者: Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, Jing Tang
cs.AI
摘要
加速擴散模型採樣對於高效部署AIGC至關重要。雖然基於分佈匹配和軌跡匹配的擴散蒸餾方法能將採樣步驟減少至僅需一步,但這些方法在處理如文本到圖像生成等複雜任務時表現欠佳。少步生成在速度與質量之間提供了更好的平衡,但現有方法面臨一個持續的權衡:分佈匹配在多步採樣中缺乏靈活性,而軌跡匹配往往導致圖像質量不佳。為彌合這一差距,我們提出通過軌跡分佈匹配(TDM)學習少步擴散模型,這是一種結合了分佈匹配和軌跡匹配優勢的統一蒸餾範式。我們的方法引入了一種無數據的得分蒸餾目標,在分佈層面上對齊學生模型與教師模型的軌跡。此外,我們開發了一種採樣步驟感知的目標,解耦了不同步驟的學習目標,從而實現更可調節的採樣。這種方法既支持確定性採樣以獲得優質圖像,也支持靈活的多步適應,以卓越的效率達到了最先進的性能。我們的模型TDM在多種骨幹網絡(如SDXL和PixArt-alpha)上均優於現有方法,提供了更高的質量並顯著降低了訓練成本。特別是,我們的方法將PixArt-alpha蒸餾成一個4步生成器,在1024分辨率下根據真實用戶偏好超越了其教師模型。這僅需500次迭代和2個A800小時——僅為教師模型訓練成本的0.01%。此外,我們提出的TDM還可以擴展到加速文本到視頻的擴散。值得注意的是,TDM在VBench上僅使用4個NFE即可超越其教師模型(CogVideoX-2B),將總分從80.91提升至81.65。項目頁面:https://tdm-t2x.github.io/
English
Accelerating diffusion model sampling is crucial for efficient AIGC
deployment. While diffusion distillation methods -- based on distribution
matching and trajectory matching -- reduce sampling to as few as one step, they
fall short on complex tasks like text-to-image generation. Few-step generation
offers a better balance between speed and quality, but existing approaches face
a persistent trade-off: distribution matching lacks flexibility for multi-step
sampling, while trajectory matching often yields suboptimal image quality. To
bridge this gap, we propose learning few-step diffusion models by Trajectory
Distribution Matching (TDM), a unified distillation paradigm that combines the
strengths of distribution and trajectory matching. Our method introduces a
data-free score distillation objective, aligning the student's trajectory with
the teacher's at the distribution level. Further, we develop a
sampling-steps-aware objective that decouples learning targets across different
steps, enabling more adjustable sampling. This approach supports both
deterministic sampling for superior image quality and flexible multi-step
adaptation, achieving state-of-the-art performance with remarkable efficiency.
Our model, TDM, outperforms existing methods on various backbones, such as SDXL
and PixArt-alpha, delivering superior quality and significantly reduced
training costs. In particular, our method distills PixArt-alpha into a
4-step generator that outperforms its teacher on real user preference at 1024
resolution. This is accomplished with 500 iterations and 2 A800 hours -- a mere
0.01% of the teacher's training cost. In addition, our proposed TDM can be
extended to accelerate text-to-video diffusion. Notably, TDM can outperform its
teacher model (CogVideoX-2B) by using only 4 NFE on VBench, improving the total
score from 80.91 to 81.65. Project page: https://tdm-t2x.github.io/Summary
AI-Generated Summary