ルックアップテーブル量子化されたLLMのための高速行列乗算

要旨

大規模言語モデル（LLMs）の展開は、メモリ帯域幅によって制約されることが多く、その主なボトルネックは、モデルパラメータをGPUのグローバルメモリからレジスタに転送するコストである。非量子化と行列積演算を融合したカスタムカーネルと組み合わせることで、重みのみの量子化は、メモリ移動量を削減することにより、より高速な推論を可能にする。しかし、重みが非均等なビット幅（例えば3ビット）で非均一なルックアップテーブル（LUT）量子化によって圧縮されている場合、重み量子化されたLLMsのための高性能カーネルの開発は大きな課題となる。本論文では、LUT量子化されたLLMsのための柔軟なルックアップテーブルエンジンであるFLUTEを紹介する。FLUTEは、量子化された重み行列のオフライン再構築を使用して、アンパックに関連するビット操作を最小化し、ルックアップテーブルのベクトル化と複製によって共有メモリ帯域幅の制約を緩和する。バッチサイズが32未満で量子化グループサイズが128（LLM推論では典型的）の場合、FLUTEカーネルは既存のGEMMカーネルよりも2～4倍高速である。FLUTEの応用例として、ルックアップテーブルベースのNormalFloat量子化の簡単な拡張を探り、LLaMA3を様々な設定で量子化し、強力なベースラインに対して競争力のある量子化性能を達成するとともに、エンドツーエンドのスループットを1.5～2倍向上させた。

English

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

ルックアップテーブル量子化されたLLMのための高速行列乗算

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

要旨

Support