GPTVQ: 大規模言語モデル量子化における次元の祝福

要旨

本研究では、量子化の次元数を増やすことで、ニューラルネットワークのサイズと精度のトレードオフを大幅に改善できることを示します。我々は、大規模言語モデル（LLM）に適した新しい高速なポストトレーニングベクトル量子化（VQ）手法であるGPTVQを提案します。この手法では、1つ以上の列の量子化と、残りの未量子化重みの更新を、層ごとの出力再構成MSEのヘッシアン情報を利用して交互に行います。量子化コードブックは、効率的なデータ認識型EMアルゴリズムを用いて初期化されます。その後、コードブックを更新し、整数量子化とSVDベースの圧縮を用いてさらに圧縮します。GPTVQは、Llama-v2やMistralなどの幅広いLLMにおいて、サイズと精度のトレードオフにおいて新たな最先端を確立します。さらに、本手法は効率的であり、単一のH100上でLlamav2-70Bモデルを処理するのに、量子化設定に応じて3時間から11時間かかります。最後に、モバイルCPU上でのVQ展開のオンデバイス計測結果から、VQが4ビット整数フォーマットを使用する場合と比較してレイテンシが改善されることを示します。

English

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

GPTVQ: 大規模言語モデル量子化における次元の祝福

GPTVQ: The Blessing of Dimensionality for LLM Quantization

要旨

Support