關於文本到圖像擴散模型的架構壓縮

摘要

穩定擴散模型（SDMs）在文本轉圖像（T2I）生成方面取得了卓越的成果，但也伴隨著相當大的計算需求。為解決此問題，最近對高效SDMs的研究優先考慮減少採樣步驟數量和利用網絡量化。與這些方向相對，本研究突顯了通過引入去塊知識蒸餾SDMs（BK-SDMs）來實現通用T2I合成的古典架構壓縮的優勢。我們從SDMs的U-Net中刪除了多個殘差和注意力塊，使參數數量、每個採樣步驟的MACs和延遲均減少了超過30%。我們僅使用單個A100 GPU上的0.22M LAION對進行基於蒸餾的預訓練（少於完整訓練對的0.1%）。儘管使用有限資源進行訓練，我們的緊湊模型可以通過轉移學習受益，模仿原始SDM並在零樣本MS-COCO基準測試中與具有數十億參數的大型模型取得競爭力的結果。此外，我們展示了我們輕量預訓練模型在通過DreamBooth微調進行個性化生成中的應用性。

English

Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

關於文本到圖像擴散模型的架構壓縮

On Architectural Compression of Text-to-Image Diffusion Models

摘要

Support