改进的分布匹配蒸馏用于快速图像合成
Improved Distribution Matching Distillation for Fast Image Synthesis
May 23, 2024
作者: Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, William T. Freeman
cs.AI
摘要
最近的方法表明将扩散模型提炼为高效的一步生成器具有潜力。其中,分布匹配提炼(DMD)产生与其教师在分布上匹配的一步生成器,而不强制要求与其教师的采样轨迹一一对应。然而,为了确保稳定训练,DMD需要使用由教师通过多步确定性采样器生成的大量噪声图像对计算额外的回归损失。这对于大规模文本到图像合成来说成本高昂,并限制了学生模型的质量,使其与教师的原始采样路径过于密切相关。我们引入了DMD2,一组技术来解除这一限制并改进DMD训练。首先,我们消除了回归损失和昂贵数据集构建的需要。我们展示了由于虚假评论家未准确估计生成样本的分布而导致的不稳定性,并提出了双时间尺度更新规则作为补救措施。其次,我们将GAN损失整合到提炼过程中,区分生成样本和真实图像。这使我们能够在真实数据上训练学生模型,减轻了教师模型对真实评分估计的不完美,并提高了质量。最后,我们修改了训练过程以实现多步采样。我们在这种情况下确定并解决了训练-推断输入不匹配的问题,通过在训练时模拟推断时生成器的样本。综合而言,我们的改进在一步图像生成中设立了新的基准,ImageNet-64x64的FID得分为1.28,在零样本COCO 2014上为8.35,尽管推断成本减少了500倍,但超过了原始教师。此外,我们展示了我们的方法可以通过提炼SDXL生成百万像素图像,展示了在少步方法中出色的视觉质量。
English
Recent approaches have shown promises distilling diffusion models into
efficient one-step generators. Among them, Distribution Matching Distillation
(DMD) produces one-step generators that match their teacher in distribution,
without enforcing a one-to-one correspondence with the sampling trajectories of
their teachers. However, to ensure stable training, DMD requires an additional
regression loss computed using a large set of noise-image pairs generated by
the teacher with many steps of a deterministic sampler. This is costly for
large-scale text-to-image synthesis and limits the student's quality, tying it
too closely to the teacher's original sampling paths. We introduce DMD2, a set
of techniques that lift this limitation and improve DMD training. First, we
eliminate the regression loss and the need for expensive dataset construction.
We show that the resulting instability is due to the fake critic not estimating
the distribution of generated samples accurately and propose a two time-scale
update rule as a remedy. Second, we integrate a GAN loss into the distillation
procedure, discriminating between generated samples and real images. This lets
us train the student model on real data, mitigating the imperfect real score
estimation from the teacher model, and enhancing quality. Lastly, we modify the
training procedure to enable multi-step sampling. We identify and address the
training-inference input mismatch problem in this setting, by simulating
inference-time generator samples during training time. Taken together, our
improvements set new benchmarks in one-step image generation, with FID scores
of 1.28 on ImageNet-64x64 and 8.35 on zero-shot COCO 2014, surpassing the
original teacher despite a 500X reduction in inference cost. Further, we show
our approach can generate megapixel images by distilling SDXL, demonstrating
exceptional visual quality among few-step methods.