用于查找表量化LLM的快速矩阵乘法

摘要

大型语言模型（LLMs）的部署通常受到内存带宽的限制，其中主要瓶颈是从GPU的全局内存传输模型参数到其寄存器的成本。当与融合去量化和矩阵乘法操作的自定义内核结合时，仅权重量化可以通过减少内存移动量来实现更快的推理。然而，为权重量化的LLMs开发高性能内核存在重大挑战，特别是当权重被压缩为非均匀可分割比特宽度（例如，3比特）并采用非均匀查找表（LUT）量化时。本文描述了FLUTE，一种用于LUT量化LLMs的灵活查找表引擎，它利用离线重组量化权重矩阵以最小化与解包相关的比特操作，并对查找表进行矢量化和复制以减轻共享内存带宽约束。在批量大小小于32且量化组大小为128（LLM推理中的典型值）时，FLUTE内核的速度可以比现有的GEMM内核快2-4倍。作为FLUTE的一个应用，我们探索了基于查找表的NormalFloat量化的简单扩展，并将其应用于将LLaMA3量化为各种配置，获得了与强基线相竞争的量化性能，同时获得了端到端吞吐量增加1.5到2倍。

English

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

用于查找表量化LLM的快速矩阵乘法

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

摘要

Support