关于文本到图像扩散模型的架构压缩

摘要

稳定扩散模型（SDMs）在文本到图像（T2I）生成方面取得了出色的结果，但也伴随着大量的计算需求。为解决这一问题，最近关于高效SDMs的研究优先考虑减少采样步骤的数量和利用网络量化。与这些方向相反，本研究突出了通过引入去块知识蒸馏SDMs（BK-SDMs）来强调经典架构压缩在通用T2I合成中的作用。我们从SDMs的U-Net中消除了多个残差和注意力块，获得了超过30%的参数数量、每个采样步骤的MACs以及延迟的减少。我们仅使用0.22M LAION对进行基于蒸馏的预训练（少于完整训练对的0.1%）在单个A100 GPU上进行。尽管在有限资源下训练，我们的紧凑模型可以通过转移学习获益模仿原始SDM，并在零样本MS-COCO基准测试上与更大的数十亿参数模型取得竞争力。此外，我们展示了我们的轻量级预训练模型在通过DreamBooth微调进行个性化生成中的适用性。

English

Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

关于文本到图像扩散模型的架构压缩

On Architectural Compression of Text-to-Image Diffusion Models

摘要

Support