GPTVQ:LLM 量化的維度祝福
GPTVQ: The Blessing of Dimensionality for LLM Quantization
February 23, 2024
作者: Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough
cs.AI
摘要
在這項工作中,我們展示了神經網絡量化的大小與準確性之間的折衷可以通過增加量化維度來顯著改善。我們提出了GPTVQ方法,這是一種新的快速後訓練向量量化(VQ)方法,適用於大型語言模型(LLMs)。我們的方法交錯量化一個或多個列與更新其餘未量化權重,利用每層輸出重建均方誤差的Hessian信息。量化碼本使用高效的數據感知版本的EM算法進行初始化。然後通過整數量化和基於SVD的壓縮進一步更新和壓縮碼本。GPTVQ在Llama-v2和Mistral等各種LLMs上建立了新的大小與準確性折衷的最新技術水平。此外,我們的方法高效:在單個H100上,處理Llamav2-70B模型需要3至11小時不等,具體取決於量化設置。最後,通過在移動CPU上進行VQ解壓縮的設備內計時,我們展示了VQ相對於使用4位整數格式可以改善延遲。
English
In this work we show that the size versus accuracy trade-off of neural
network quantization can be significantly improved by increasing the
quantization dimensionality. We propose the GPTVQ method, a new fast method for
post-training vector quantization (VQ) that scales well to Large Language
Models (LLMs). Our method interleaves quantization of one or more columns with
updates to the remaining unquantized weights, using information from the
Hessian of the per-layer output reconstruction MSE. Quantization codebooks are
initialized using an efficient data-aware version of the EM algorithm. The
codebooks are then updated, and further compressed by using integer
quantization and SVD-based compression. GPTVQ establishes a new state-of-the
art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2
and Mistral. Furthermore, our method is efficient: on a single H100 it takes
between 3 and 11 hours to process a Llamav2-70B model, depending on
quantization setting. Lastly, with on-device timings for VQ decompression on a
mobile CPU we show that VQ leads to improved latency compared to using a 4-bit
integer format.