テキストから画像への拡散モデルの建築的圧縮について

要旨

Stable Diffusionモデル（SDM）の優れたテキストから画像（T2I）生成能力は、多大な計算リソースを必要とします。この問題を解決するため、最近の効率的なSDMに関する研究では、サンプリングステップ数の削減やネットワークの量子化に焦点が当てられてきました。これらとは異なるアプローチとして、本研究では、ブロック除去型知識蒸留SDM（BK-SDM）を導入し、汎用T2I合成における古典的なアーキテクチャ圧縮の有効性を明らかにします。SDMのU-Netから複数の残差ブロックとアテンションブロックを除去することで、パラメータ数、サンプリングステップあたりのMACs、およびレイテンシを30％以上削減しました。さらに、わずか0.22MのLAIONペア（全学習ペアの0.1％未満）を用いて、単一のA100 GPUで蒸留ベースの事前学習を実施しました。限られたリソースで学習されたにもかかわらず、我々のコンパクトモデルは、転移された知識を活用して元のSDMを模倣し、ゼロショットMS-COCOベンチマークにおいて、数十億パラメータを持つ大規模モデルと競合する結果を達成しました。さらに、DreamBoothファインチューニングを用いたパーソナライズド生成において、軽量な事前学習モデルの適用可能性を実証しました。

English

Exceptional text-to-image (T2I) generation results of Stable Diffusion models (SDMs) come with substantial computational demands. To resolve this issue, recent research on efficient SDMs has prioritized reducing the number of sampling steps and utilizing network quantization. Orthogonal to these directions, this study highlights the power of classical architectural compression for general-purpose T2I synthesis by introducing block-removed knowledge-distilled SDMs (BK-SDMs). We eliminate several residual and attention blocks from the U-Net of SDMs, obtaining over a 30% reduction in the number of parameters, MACs per sampling step, and latency. We conduct distillation-based pretraining with only 0.22M LAION pairs (fewer than 0.1% of the full training pairs) on a single A100 GPU. Despite being trained with limited resources, our compact models can imitate the original SDM by benefiting from transferred knowledge and achieve competitive results against larger multi-billion parameter models on the zero-shot MS-COCO benchmark. Moreover, we demonstrate the applicability of our lightweight pretrained models in personalized generation with DreamBooth finetuning.

テキストから画像への拡散モデルの建築的圧縮について

On Architectural Compression of Text-to-Image Diffusion Models

要旨

Support