SVDQunat: 4ビット拡散モデルにおける低ランク成分による外れ値の吸収

要旨

拡散モデルは、高品質な画像生成において非常に効果的であることが証明されています。しかし、これらのモデルが大きくなるにつれ、メモリ使用量が大幅に増加し、レイテンシも高くなるため、デプロイメントにおいて大きな課題となっています。本研究では、拡散モデルの重みと活性化を4ビットに量子化することで高速化を目指します。このような積極的な量子化レベルでは、重みと活性化の両方が非常に敏感であり、大規模言語モデルにおける従来のポストトレーニング量子化手法（例えばスムージング）では不十分です。この制限を克服するため、我々はSVDQuantという新しい4ビット量子化パラダイムを提案します。スムージングが重みと活性化の間で外れ値を再分配するのとは異なり、我々のアプローチでは低ランク分岐を用いてこれらの外れ値を吸収します。まず、活性化から重みへ外れ値をシフトすることで外れ値を統合し、その後、特異値分解（SVD）を用いて高精度の低ランク分岐で重みの外れ値を取り込みます。このプロセスにより、両側の量子化が容易になります。しかし、低ランク分岐を独立して実行すると、活性化の追加データ移動により大きなオーバーヘッドが発生し、量子化による高速化が相殺されます。この問題に対処するため、我々は低ランク分岐のカーネルを低ビット分岐のカーネルに融合させ、冗長なメモリアクセスを削減する推論エンジンNunchakuを共同設計しました。これにより、再量子化を必要とせずに既存の低ランクアダプター（LoRA）をシームレスにサポートすることも可能です。SDXL、PixArt-Sigma、FLUX.1における広範な実験により、SVDQuantが画像品質を維持する効果を検証しました。12B FLUX.1モデルのメモリ使用量を3.5倍削減し、16GBラップトップの4090 GPU上で4ビット重みのみの量子化ベースラインに対して3.0倍の高速化を達成し、PC上でのよりインタラクティブなアプリケーションの道を開きました。我々の量子化ライブラリと推論エンジンはオープンソースとして公開されています。

English

Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first consolidate the outliers by shifting them from activations to weights, then employ a high-precision low-rank branch to take in the weight outliers with Singular Value Decomposition (SVD). This process eases the quantization on both sides. However, na\"{\i}vely running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we co-design an inference engine Nunchaku that fuses the kernels of the low-rank branch into those of the low-bit branch to cut off redundant memory access. It can also seamlessly support off-the-shelf low-rank adapters (LoRAs) without the need for re-quantization. Extensive experiments on SDXL, PixArt-Sigma, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.5times, achieving 3.0times speedup over the 4-bit weight-only quantized baseline on the 16GB laptop 4090 GPU, paving the way for more interactive applications on PCs. Our quantization library and inference engine are open-sourced.

SVDQunat: 4ビット拡散モデルにおける低ランク成分による外れ値の吸収

SVDQunat: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

要旨

Support