SVDQunat:透過低秩成分吸收4位元擴散模型中的離群值
SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
November 7, 2024
作者: Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, Song Han
cs.AI
摘要
擴散模型已被證實在生成高質量圖像方面非常有效。然而,隨著這些模型變得更大,它們需要更多的記憶體,並且遭受更高的延遲,這對部署構成了重大挑戰。在這項工作中,我們旨在通過將擴散模型的權重和激活量量化為4位元來加速這些模型。在這種激進水平下,權重和激活量都非常敏感,傳統的用於大型語言模型的後訓練量化方法,如平滑法,變得不夠。為了克服這一限制,我們提出了SVDQuant,一種新的4位元量化範式。與將異常值在權重和激活量之間重新分配的平滑法不同,我們的方法使用低秩分支吸收這些異常值。我們首先通過將異常值從激活量轉移到權重來整合這些異常值,然後利用高精度的低秩分支使用奇異值分解(SVD)來吸收權重的異常值。這個過程緩解了兩側的量化。然而,單獨運行低秩分支會產生額外的激活量數據移動,導致量化加速失效。為了解決這個問題,我們共同設計了一個推理引擎 Nunchaku,將低秩分支的核心融合到低位元分支的核心中,以切斷多餘的記憶體訪問。它還可以無縫支持即插即用的低秩適配器(LoRAs),無需重新量化。在SDXL、PixArt-Sigma和FLUX.1上進行的大量實驗驗證了SVDQuant在保持圖像質量方面的有效性。我們將12B FLUX.1模型的記憶體使用量減少了3.5倍,在16GB筆記本電腦4090 GPU上,相對於4位元僅權重量化基線,實現了3.0倍的加速,為PC上的更多互動應用鋪平了道路。我們的量化庫和推理引擎已開源。
English
Diffusion models have been proven highly effective at generating high-quality
images. However, as these models grow larger, they require significantly more
memory and suffer from higher latency, posing substantial challenges for
deployment. In this work, we aim to accelerate diffusion models by quantizing
their weights and activations to 4 bits. At such an aggressive level, both
weights and activations are highly sensitive, where conventional post-training
quantization methods for large language models like smoothing become
insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit
quantization paradigm. Different from smoothing which redistributes outliers
between weights and activations, our approach absorbs these outliers using a
low-rank branch. We first consolidate the outliers by shifting them from
activations to weights, then employ a high-precision low-rank branch to take in
the weight outliers with Singular Value Decomposition (SVD). This process eases
the quantization on both sides. However, na\"{\i}vely running the low-rank
branch independently incurs significant overhead due to extra data movement of
activations, negating the quantization speedup. To address this, we co-design
an inference engine Nunchaku that fuses the kernels of the low-rank branch into
those of the low-bit branch to cut off redundant memory access. It can also
seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for
re-quantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1
validate the effectiveness of SVDQuant in preserving image quality. We reduce
the memory usage for the 12B FLUX.1 models by 3.5times, achieving
3.0times speedup over the 4-bit weight-only quantized baseline on the 16GB
laptop 4090 GPU, paving the way for more interactive applications on PCs. Our
quantization library and inference engine are open-sourced.