GPTVQ：LLM 量化的維度祝福

摘要

在這項工作中，我們展示了神經網絡量化的大小與準確性之間的折衷可以通過增加量化維度來顯著改善。我們提出了GPTVQ方法，這是一種新的快速後訓練向量量化（VQ）方法，適用於大型語言模型（LLMs）。我們的方法交錯量化一個或多個列與更新其餘未量化權重，利用每層輸出重建均方誤差的Hessian信息。量化碼本使用高效的數據感知版本的EM算法進行初始化。然後通過整數量化和基於SVD的壓縮進一步更新和壓縮碼本。GPTVQ在Llama-v2和Mistral等各種LLMs上建立了新的大小與準確性折衷的最新技術水平。此外，我們的方法高效：在單個H100上，處理Llamav2-70B模型需要3至11小時不等，具體取決於量化設置。最後，通過在移動CPU上進行VQ解壓縮的設備內計時，我們展示了VQ相對於使用4位整數格式可以改善延遲。

English

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

GPTVQ：LLM 量化的維度祝福

GPTVQ: The Blessing of Dimensionality for LLM Quantization

摘要

Support