时钟扩散：模型步骤蒸馏下的高效生成

摘要

本工作旨在提高文本到图像扩散模型的效率。尽管扩散模型在每一代中使用基于UNet的计算昂贵的去噪操作，但我们发现并非所有操作对最终输出质量都同等重要。特别是，我们观察到在高分辨率特征图上操作的UNet层对微小扰动相对敏感。相反，低分辨率特征图影响最终图像的语义布局，并且通常可以在不会引起输出明显变化的情况下被扰动。基于这一观察，我们提出了Clockwork Diffusion，一种周期性地重复利用先前去噪步骤的计算，以在一个或多个随后的步骤中近似低分辨率特征图的方法。对于多个基线以及文本到图像生成和图像编辑，我们证明Clockwork方法在大大降低计算复杂度的同时，实现了与基线相当或更好的感知评分。例如，对于具有8个DPM++步骤的Stable Diffusion v1.5，我们节省了32%的FLOPs，而FID和CLIP变化可以忽略不计。

English

This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.

时钟扩散：模型步骤蒸馏下的高效生成

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

摘要

Support