用於查找表量化LLM的快速矩陣乘法

摘要

大型語言模型（LLMs）的部署通常受到內存帶寬的限制，其中主要瓶頸是從 GPU 的全局內存傳輸模型參數到其寄存器的成本。當結合自定義內核來融合去量化和矩陣乘法運算時，僅權重量化可以通過減少內存移動量來加快推理速度。然而，為權重量化的LLMs開發高性能內核存在著重大挑戰，特別是當權重被壓縮為非均勻可分割位寬（例如3位）並具有非均勻查找表（LUT）量化時。本文描述了FLUTE，一個靈活的查找表引擎，用於LUT量化的LLMs，該引擎使用離線重組量化權重矩陣以最小化與解包相關的位操作，並對查找表進行向量化和複製以減輕共享內存帶寬限制。在批量大小<32且量化組大小為128（在LLM推理中典型），FLUTE內核可以比現有的GEMM內核快2-4倍。作為FLUTE的應用，我們探索了基於查找表的NormalFloat量化的簡單擴展，並將其應用於對LLaMA3進行各種配置的量化，獲得了與強基線相競爭的量化性能，同時實現了端到端吞吐量增加1.5至2倍。

English

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

用於查找表量化LLM的快速矩陣乘法

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

摘要

Support