通過向量量化實現文本到圖像擴散模型的準確壓縮

摘要

文字到圖像擴散模型已成為一個強大的框架，用於根據文本提示生成高質量圖像。它們的成功推動了生產級擴散模型的快速發展，這些模型不斷增大，已包含數十億個參數。因此，最先進的文字到圖像模型在實踐中變得越來越不易訪問，尤其是在資源有限的環境中。事後訓練量化（PTQ）通過將預訓練模型權重壓縮為低位表示來應對這個問題。最近的擴散量化技術主要依賴於均勻標量量化，為壓縮為4位的模型提供了不錯的性能。本研究表明，更多功能的向量量化（VQ）可能實現大規模文字到圖像擴散模型的更高壓縮率。具體而言，我們將基於向量的PTQ方法定制為最近的十億級文字到圖像模型（SDXL和SDXL-Turbo），並展示了將具有20億參數的擴散模型壓縮為約3位，使用VQ展現出與先前4位壓縮技術相似的圖像質量和文本對齊。

English

Text-to-image diffusion models have emerged as a powerful framework for high-quality image generation given textual prompts. Their success has driven the rapid development of production-grade diffusion models that consistently increase in size and already contain billions of parameters. As a result, state-of-the-art text-to-image models are becoming less accessible in practice, especially in resource-limited environments. Post-training quantization (PTQ) tackles this issue by compressing the pretrained model weights into lower-bit representations. Recent diffusion quantization techniques primarily rely on uniform scalar quantization, providing decent performance for the models compressed to 4 bits. This work demonstrates that more versatile vector quantization (VQ) may achieve higher compression rates for large-scale text-to-image diffusion models. Specifically, we tailor vector-based PTQ methods to recent billion-scale text-to-image models (SDXL and SDXL-Turbo), and show that the diffusion models of 2B+ parameters compressed to around 3 bits using VQ exhibit the similar image quality and textual alignment as previous 4-bit compression techniques.

通過向量量化實現文本到圖像擴散模型的準確壓縮

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

摘要

Support