Präzise Kompression von Text-zu-Bild-Diffusionsmodellen durch Vektorquantisierung

Zusammenfassung

Text-zu-Bild-Diffusionsmodelle haben sich als ein leistungsstarkes Rahmenwerk für die hochwertige Bildgenerierung anhand von textuellen Vorgaben erwiesen. Ihr Erfolg hat die rasante Entwicklung von Diffusionsmodellen auf Produktionsniveau vorangetrieben, die kontinuierlich an Größe zunehmen und bereits Milliarden von Parametern enthalten. Als Ergebnis werden modernste Text-zu-Bild-Modelle in der Praxis immer weniger zugänglich, insbesondere in ressourcenbeschränkten Umgebungen. Die Post-Training-Quantisierung (PTQ) begegnet diesem Problem, indem sie die vortrainierten Modellgewichte in niedrigerwertige Darstellungen komprimiert. Aktuelle Diffusionsquantisierungstechniken stützen sich hauptsächlich auf gleichmäßige skalare Quantisierung, die eine anständige Leistung für die auf 4 Bits komprimierten Modelle bietet. Diese Arbeit zeigt, dass eine vielseitigere Vektorquantisierung (VQ) möglicherweise höhere Komprimierungsraten für groß angelegte Text-zu-Bild-Diffusionsmodelle erreichen kann. Speziell passen wir vektorbasierte PTQ-Methoden an aktuelle Milliarden-Maßstab Text-zu-Bild-Modelle (SDXL und SDXL-Turbo) an und zeigen, dass die Diffusionsmodelle mit 2B+ Parametern, die auf rund 3 Bits komprimiert sind, unter Verwendung von VQ eine ähnliche Bildqualität und textliche Ausrichtung wie bisherige 4-Bit-Kompressionstechniken aufweisen.

English

Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

Präzise Kompression von Text-zu-Bild-Diffusionsmodellen durch Vektorquantisierung

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

Zusammenfassung

Support