텍스트-이미지 확산 모델의 구조적 압축에 관하여

초록

Stable Diffusion 모델(SDMs)의 뛰어난 텍스트-이미지(T2I) 생성 결과는 상당한 계산 비용을 동반합니다. 이 문제를 해결하기 위해, 최근의 효율적인 SDMs 연구는 샘플링 단계 수를 줄이고 네트워크 양자화를 활용하는 데 초점을 맞추었습니다. 이러한 방향과는 별개로, 본 연구는 블록 제거 기반 지식 증류 SDMs(BK-SDMs)를 도입하여 일반적인 목적의 T2I 합성을 위한 고전적인 아키텍처 압축의 힘을 강조합니다. 우리는 SDMs의 U-Net에서 여러 잔차 블록과 어텐션 블록을 제거하여 매개변수 수, 샘플링 단계당 MACs, 그리고 지연 시간을 30% 이상 줄였습니다. 단일 A100 GPU에서 0.22M LAION 쌍(전체 학습 쌍의 0.1% 미만)으로 증류 기반 사전 학습을 수행했습니다. 제한된 자원으로 학습되었음에도 불구하고, 우리의 컴팩트 모델은 전달된 지식의 이점을 통해 원본 SDM을 모방할 수 있으며, 제로샷 MS-COCO 벤치마크에서 더 큰 수십억 개의 매개변수를 가진 모델들과 경쟁력 있는 결과를 달성했습니다. 또한, 우리는 DreamBooth 미세 조정을 통한 개인화 생성에서 경량화된 사전 학습 모델의 적용 가능성을 입증했습니다.

English

Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

텍스트-이미지 확산 모델의 구조적 압축에 관하여

On Architectural Compression of Text-to-Image Diffusion Models

초록

Support