GPTVQ：LLM 量化中维度的祝福

摘要

在这项工作中，我们展示了神经网络量化的大小与准确性之间的权衡可以通过增加量化维度来显著改善。我们提出了GPTVQ方法，这是一种新的快速后训练向量量化（VQ）方法，可很好地扩展到大型语言模型（LLMs）。我们的方法交替进行一列或多列的量化，并更新其余未量化权重，利用每层输出重构均方误差的Hessian信息。量化码书使用高效的数据感知版本的EM算法进行初始化。然后通过整数量化和基于SVD的压缩进一步压缩更新码书。GPTVQ在诸如Llama-v2和Mistral等各种LLMs上建立了新的大小与准确性权衡的最新技术水平。此外，我们的方法高效：在单个H100上，处理一个Llamav2-70B模型需要3至11小时不等，具体取决于量化设置。最后，通过在移动CPU上进行设备内定时进行VQ解压缩，我们展示了与使用4位整数格式相比，VQ可以带来改善的延迟。

English

In this work we show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.

GPTVQ：LLM 量化中维度的祝福

GPTVQ: The Blessing of Dimensionality for LLM Quantization

摘要

Support