SVDQunat:通过低秩分量吸收离群值用于4比特扩散模型
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
November 7, 2024
作者: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
cs.AI
摘要
扩散模型已被证明在生成高质量图像方面非常有效。然而,随着这些模型变得更大,它们需要更多的内存并且遭受更高的延迟,这给部署带来了重大挑战。在这项工作中,我们旨在通过将扩散模型的权重和激活量量化为4位来加速这些模型。在这种激进水平下,权重和激活量都非常敏感,传统的用于大型语言模型的后训练量化方法,如平滑,变得不够。为了克服这一限制,我们提出了SVDQuant,一种新的4位量化范式。与平滑不同,后者在权重和激活量之间重新分配异常值,我们的方法利用低秩分支吸收这些异常值。我们首先通过将异常值从激活量移至权重来整合这些异常值,然后利用高精度的低秩分支使用奇异值分解来处理权重异常值。这个过程简化了双方的量化。然而,简单地独立运行低秩分支会产生显着的开销,因为需要额外的激活数据移动,抵消了量化加速。为了解决这个问题,我们共同设计了一个名为Nunchaku的推理引擎,将低秩分支的内核融合到低位分支的内核中,以消除冗余的内存访问。它还可以无缝支持现成的低秩适配器(LoRAs),无需重新量化。在SDXL、PixArt-Sigma和FLUX.1上进行的大量实验证实了SVDQuant在保持图像质量方面的有效性。我们将12B FLUX.1模型的内存使用量减少了3.5倍,在16GB笔记本电脑4090 GPU上比基准的4位仅权重量化基线实现了3.0倍的加速,为PC上的更多互动应用铺平了道路。我们的量化库和推理引擎已开源。
English
Diffusion models have been proven highly effective at generating high-quality
images. However, as these models grow larger, they require significantly more
memory and suffer from higher latency, posing substantial challenges for
deployment. In this work, we aim to accelerate diffusion models by quantizing
their weights and activations to 4 bits. At such an aggressive level, both
weights and activations are highly sensitive, where conventional post-training
quantization methods for large language models like smoothing become
insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit
quantization paradigm. Different from smoothing which redistributes outliers
between weights and activations, our approach absorbs these outliers using a
low-rank branch. We first consolidate the outliers by shifting them from
activations to weights, then employ a high-precision low-rank branch to take in
the weight outliers with Singular Value Decomposition (SVD). This process eases
the quantization on both sides. However, na\"{\i}vely running the low-rank
branch independently incurs significant overhead due to extra data movement of
activations, negating the quantization speedup. To address this, we co-design
an inference engine Nunchaku that fuses the kernels of the low-rank branch into
those of the low-bit branch to cut off redundant memory access. It can also
seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for
re-quantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1
validate the effectiveness of SVDQuant in preserving image quality. We reduce
the memory usage for the 12B FLUX.1 models by 3.5times, achieving
3.0times speedup over the 4-bit weight-only quantized baseline on the 16GB
laptop 4090 GPU, paving the way for more interactive applications on PCs. Our
quantization library and inference engine are open-sourced.