改進的分佈匹配蒸餾技術用於快速圖像合成
Improved Distribution Matching Distillation for Fast Image Synthesis
May 23, 2024
作者: Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman
cs.AI
摘要
最近的研究方法展示了將擴散模型提煉為高效的單步生成器的潛力。其中,分布匹配提煉(Distribution Matching Distillation,DMD)產生與其教師在分布上匹配的單步生成器,而不要求與其教師的採樣軌跡一一對應。然而,為確保穩定的訓練,DMD需要使用大量由教師通過多步驟的確定性取樣器生成的噪聲-影像對進行計算的額外回歸損失。這對於大規模文本到圖像合成來說成本高昂,並限制了學生的質量,使其與教師的原始採樣路徑過於密切相關。我們介紹了DMD2,一組技術來消除這種限制並改善DMD訓練。首先,我們消除了回歸損失和昂貴數據集構建的需求。我們展示由於假評論者未準確估計生成樣本的分布而導致的不穩定性,並提出了雙時間尺度更新規則作為補救措施。其次,我們將生成對抗網絡(GAN)損失整合到提煉過程中,區分生成的樣本和真實圖像。這使我們能夠在真實數據上訓練學生模型,減輕教師模型對真實分數估計的不完美性,並提高質量。最後,我們修改了訓練程序以實現多步驟採樣。我們在這種情況下識別並解決了訓練-推理輸入不匹配的問題,通過在訓練時模擬推理時間生成器樣本。綜合而言,我們的改進在單步圖像生成中設立了新的基準,ImageNet-64x64的FID分數為1.28,在零樣本COCO 2014上為8.35,優於原始教師,儘管推理成本減少了500倍。此外,我們展示了我們的方法可以通過提煉SDXL生成百萬像素圖像,展示了在少步驟方法中出色的視覺質量。
English
Recent approaches have shown promises distilling diffusion models into
efficient one-step generators. Among them, Distribution Matching Distillation
(DMD) produces one-step generators that match their teacher in distribution,
without enforcing a one-to-one correspondence with the sampling trajectories of
their teachers. However, to ensure stable training, DMD requires an additional
regression loss computed using a large set of noise-image pairs generated by
the teacher with many steps of a deterministic sampler. This is costly for
large-scale text-to-image synthesis and limits the student's quality, tying it
too closely to the teacher's original sampling paths. We introduce DMD2, a set
of techniques that lift this limitation and improve DMD training. First, we
eliminate the regression loss and the need for expensive dataset construction.
We show that the resulting instability is due to the fake critic not estimating
the distribution of generated samples accurately and propose a two time-scale
update rule as a remedy. Second, we integrate a GAN loss into the distillation
procedure, discriminating between generated samples and real images. This lets
us train the student model on real data, mitigating the imperfect real score
estimation from the teacher model, and enhancing quality. Lastly, we modify the
training procedure to enable multi-step sampling. We identify and address the
training-inference input mismatch problem in this setting, by simulating
inference-time generator samples during training time. Taken together, our
improvements set new benchmarks in one-step image generation, with FID scores
of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the
original teacher despite a 500X reduction in inference cost. Further, we show
our approach can generate megapixel images by distilling SDXL, demonstrating
exceptional visual quality among few-step methods.Summary
AI-Generated Summary