Clockwork Diffusion: モデルステップ蒸留による効率的な生成

要旨

本研究は、テキストから画像への拡散モデルの効率向上を目指すものである。拡散モデルでは、各生成ステップにおいて計算コストの高いUNetベースのノイズ除去操作が使用されるが、すべての操作が最終的な出力品質に同等に関与しているわけではないことが明らかになった。特に、高解像度の特徴マップを操作するUNet層は、小さな摂動に対して比較的敏感であることが観察された。一方、低解像度の特徴マップは最終画像の意味的レイアウトに影響を与えるが、しばしば摂動を加えても出力に目立った変化が見られない。この観察に基づき、我々はClockwork Diffusionを提案する。この手法では、先行するノイズ除去ステップからの計算を定期的に再利用し、1つ以上の後続ステップにおける低解像度の特徴マップを近似する。複数のベースラインに対して、テキストから画像生成および画像編集の両方において、Clockworkが大幅に計算量を削減しながら同等または改善された知覚スコアを達成することを実証した。例として、Stable Diffusion v1.5において8ステップのDPM++を使用した場合、FIDとCLIPの変化を無視できる範囲で32%のFLOPsを削減した。

English

This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.

Clockwork Diffusion: モデルステップ蒸留による効率的な生成

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

要旨

Support