VPTQ: 大規模言語モデル向けの極低ビットベクトル事後トレーニング量子化

要旨

モデルサイズのスケーリングは、大規模言語モデル（LLMs）の展開と推論に大きな課題をもたらします。LLMの重みに冗長性があるため、最近の研究では、重みのみの量子化を極めて低ビット（2ビットまで）に押し込めることに焦点を当てています。これにより、メモリ要件が削減され、ストレージコストが最適化され、推論時のメモリ帯域幅要件が低減されます。しかし、数値表現の制限により、従来のスカラーに基づく重み量子化は、このような極端に低いビット数を達成するのに苦労しています。LLMs向けのベクトル量子化（VQ）に関する最近の研究では、ベクトルをルックアップテーブルを使用してインデックスに圧縮することで、極端に低いビット数のモデル量子化の可能性が示されています。本論文では、極めて低ビットのLLMsの量子化のためのベクトル事後トレーニング量子化（VPTQ）を紹介します。LLM VQ問題を定式化し、最適化を解決することで、量子化アルゴリズムの設計を導くために、2次最適化を使用します。さらに、チャネルに独立した2次最適化を使用して、重みを微調整し、粒度の細かいVQを実現します。また、最適化問題を分解することで、簡潔で効果的なコードブックの初期化アルゴリズムを提案します。また、VPTQを残差および外れ値の量子化をサポートするよう拡張し、モデルの精度を向上させ、モデルをさらに圧縮します。実験結果によると、VPTQは、LLaMA-2において0.01-0.34、Mistral-7Bにおいて0.38-0.68、LLaMA-3において4.41-7.34のモデル量子化のパープレキシティをSOTAに比べて削減し、LLaMA-2において0.79-1.5％、Mistral-7Bにおいて1％、LLaMA-3において11-22％の平均精度向上を達成しました。また、量子化アルゴリズムの実行時間のみを10.4-18.6％利用し、SOTAに比べて推論スループットが1.6-1.8倍向上しました。

English

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8times increase in inference throughput compared to SOTA.

VPTQ: 大規模言語モデル向けの極低ビットベクトル事後トレーニング量子化

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

要旨

Support