클록워크 디퓨전: 모델-스텝 증류를 통한 효율적 생성

초록

본 연구는 텍스트-이미지 확산 모델의 효율성 향상을 목표로 합니다. 확산 모델은 모든 생성 단계에서 계산 비용이 높은 UNet 기반의 노이즈 제거 연산을 사용하지만, 모든 연산이 최종 출력 품질에 동일하게 기여하는 것은 아닙니다. 특히, 고해상도 특징 맵에서 작동하는 UNet 레이어는 작은 변화에도 상대적으로 민감한 반면, 저해상도 특징 맵은 최종 이미지의 의미론적 레이아웃에 영향을 미치며 종종 변화를 가해도 출력에 눈에 띄는 차이가 없음을 관찰했습니다. 이러한 관찰을 바탕으로, 우리는 이전 노이즈 제거 단계의 계산을 주기적으로 재사용하여 하나 이상의 후속 단계에서 저해상도 특징 맵을 근사화하는 Clockwork Diffusion 방법을 제안합니다. 여러 베이스라인과 텍스트-이미지 생성 및 이미지 편집 작업에서 Clockwork이 크게 감소된 계산 복잡도로 비슷하거나 향상된 지각 점수를 달성함을 입증했습니다. 예를 들어, Stable Diffusion v1.5에서 8개의 DPM++ 단계를 사용할 때, FID와 CLIP 점수의 미미한 변화만으로 FLOPs의 32%를 절약했습니다.

English

This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step, we identify that not all operations are equally relevant for the final output quality. In particular, we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast, low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation, we propose Clockwork Diffusion, a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple baselines, and for both text-to-image generation and image editing, we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example, for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change.

클록워크 디퓨전: 모델-스텝 증류를 통한 효율적 생성

Clockwork Diffusion: Efficient Generation With Model-Step Distillation

초록

Support