빠른 확산: 확산 모델에서 UNet 인코더의 역할 재고

초록

디퓨전 모델의 핵심 구성 요소 중 하나는 노이즈 예측을 위한 UNet입니다. 여러 연구에서 UNet 디코더의 기본 특성을 탐구했지만, 인코더는 대부분 연구되지 않은 상태로 남아 있습니다. 본 연구에서는 UNet 인코더에 대한 첫 번째 포괄적인 연구를 수행합니다. 우리는 인코더 특징을 실증적으로 분석하고, 추론 과정에서의 변화에 대한 중요한 질문들에 대한 통찰을 제공합니다. 특히, 인코더 특징은 점진적으로 변화하는 반면, 디코더 특징은 다양한 시간 단계에서 상당한 변화를 보인다는 것을 발견했습니다. 이러한 발견은 특정 인접 시간 단계에서 인코더를 생략하고, 이전 시간 단계의 인코더 특징을 디코더에 순환적으로 재사용하는 아이디어로 이어졌습니다. 이 관찰을 바탕으로, 우리는 다양한 작업에서 디퓨전 샘플링을 가속화하기 위한 간단하지만 효과적인 인코더 전파 기법을 소개합니다. 우리의 전파 기법을 활용함으로써, 특정 인접 시간 단계에서 디코더를 병렬로 수행할 수 있게 되었습니다. 또한, 생성된 이미지의 텍스처 디테일을 개선하기 위해 사전 노이즈 주입 방법을 도입했습니다. 표준 텍스트-이미지 작업 외에도, 우리의 접근 방식을 텍스트-비디오, 개인화된 생성, 참조 기반 생성과 같은 다른 작업에서도 검증했습니다. 지식 증류 기술을 사용하지 않고도, 우리의 접근 방식은 Stable Diffusion(SD)과 DeepFloyd-IF 모델의 샘플링을 각각 41%와 24% 가속화하면서도 높은 품질의 생성 성능을 유지했습니다. 우리의 코드는 https://github.com/hutaiHang/Faster-Diffusion{FasterDiffusion}에서 확인할 수 있습니다.

English

One of the key components within diffusion models is the UNet for noise prediction. While several works have explored basic properties of the UNet decoder, its encoder largely remains unexplored. In this work, we conduct the first comprehensive study of the UNet encoder. We empirically analyze the encoder features and provide insights to important questions regarding their changes at the inference process. In particular, we find that encoder features change gently, whereas the decoder features exhibit substantial variations across different time-steps. This finding inspired us to omit the encoder at certain adjacent time-steps and reuse cyclically the encoder features in the previous time-steps for the decoder. Further based on this observation, we introduce a simple yet effective encoder propagation scheme to accelerate the diffusion sampling for a diverse set of tasks. By benefiting from our propagation scheme, we are able to perform in parallel the decoder at certain adjacent time-steps. Additionally, we introduce a prior noise injection method to improve the texture details in the generated image. Besides the standard text-to-image task, we also validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation. Without utilizing any knowledge distillation technique, our approach accelerates both the Stable Diffusion (SD) and the DeepFloyd-IF models sampling by 41% and 24% respectively, while maintaining high-quality generation performance. Our code is available in https://github.com/hutaiHang/Faster-Diffusion{FasterDiffusion}.

빠른 확산: 확산 모델에서 UNet 인코더의 역할 재고

Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models

초록

Support