VPTQ：用于大型语言模型的极低比特向量后训练量化

摘要

随着模型规模的扩大，大型语言模型（LLMs）的部署和推断面临着重大挑战。由于LLM权重中存在冗余，最近的研究集中在将权重量化推向极低比特（甚至降至2比特）。这种方法减少了内存需求，优化了存储成本，并在推断过程中降低了内存带宽需求。然而，由于数值表示的限制，传统基于标量的权重量化难以实现如此极低比特。最近针对LLMs的矢量量化（VQ）研究展示了通过使用查找表将向量压缩为索引的潜力，实现了极低比特模型量化。在本文中，我们介绍了用于LLMs极低比特量化的矢量后训练量化（VPTQ）。我们使用二阶优化来制定LLM VQ问题，并通过解决优化问题来指导我们的量化算法设计。我们进一步通过使用独立通道的二阶优化来细化权重，实现了细粒度的VQ。此外，通过分解优化问题，我们提出了一种简洁有效的码书初始化算法。我们还将VPTQ扩展到支持残差和异常值量化，从而提高模型准确性并进一步压缩模型。我们的实验结果显示，VPTQ在LLaMA-2上将模型量化困惑度降低了0.01-0.34，在Mistral-7B上降低了0.38-0.68，在LLaMA-3上降低了4.41-7.34，相较于2比特的SOTA，平均准确率提高了0.79-1.5%在LLaMA-2上，1%在Mistral-7B上，平均在LLaMA-3上提高了11-22%的QA任务。我们仅利用了10.4-18.6%的量化算法执行时间，推断吞吐量比SOTA提高了1.6-1.8倍。

English

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.

VPTQ：用于大型语言模型的极低比特向量后训练量化

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

摘要

Support