ベクトル量子化を介したテキストから画像への拡散モデルの正確な圧縮

要旨

テキストから画像への拡散モデルは、テキストのプロンプトを与えられた際に高品質な画像生成のための強力なフレームワークとして台頭しています。その成功により、本番向けの拡散モデルの急速な開発が進み、これらは一貫してサイズが拡大し、すでに数十億のパラメータを含んでいます。その結果、最先端のテキストから画像へのモデルは、特にリソースに制約のある環境では実用的には利用しにくくなっています。事後トレーニング量子化（PTQ）は、事前にトレーニングされたモデルの重みを低ビット表現に圧縮することで、この問題に取り組んでいます。最近の拡散量子化技術は主に一様スカラー量子化に依存しており、4ビットに圧縮されたモデルに対してまずまずの性能を提供しています。この研究は、より多目的なベクトル量子化（VQ）が大規模なテキストから画像への拡散モデルに対してより高い圧縮率を達成できる可能性があることを示しています。具体的には、最近の数十億規模のテキストから画像へのモデル（SDXLおよびSDXL-Turbo）にベクトルベースのPTQ手法を適用し、VQを使用して2B+パラメータの拡散モデルを約3ビットに圧縮すると、以前の4ビットの圧縮技術と同様の画質とテキストの整合性が得られることを示しています。

English

Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

ベクトル量子化を介したテキストから画像への拡散モデルの正確な圧縮

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

要旨

Support