Schnelle Matrixmultiplikationen für Lookup-Tabellen-quantisierte LLMs.

Zusammenfassung

Die Bereitstellung großer Sprachmodelle (LLMs) wird oft durch den Speicherbandbreite eingeschränkt, wobei der Hauptengpass die Kosten für die Übertragung der Modellparameter vom globalen Speicher der GPU in ihre Register sind. In Verbindung mit benutzerdefinierten Kernels, die die Dequantisierungs- und Matmul-Operationen verschmelzen, kann die Gewichtsquantisierung somit durch die Reduzierung der Menge an Speicherbewegungen schnellere Inferenzen ermöglichen. Die Entwicklung von leistungsstarken Kernels für gewichtsquantisierte LLMs birgt jedoch erhebliche Herausforderungen, insbesondere wenn die Gewichte auf nicht gleichmäßig teilbare Bitbreiten (z. B. 3 Bits) mit nicht einheitlicher, Lookup-Tabellen (LUT) Quantisierung komprimiert sind. In diesem Papier wird FLUTE beschrieben, eine flexible Lookup-Tabellen-Engine für LUT-quantisierte LLMs, die die offline-Umstrukturierung der quantisierten Gewichtsmatrix zur Minimierung von Bitmanipulationen im Zusammenhang mit dem Entpacken sowie die Vektorisierung und Duplizierung der Lookup-Tabelle zur Minderung von gemeinsamen Speicherbandbreitenbeschränkungen verwendet. Bei Batch-Größen < 32 und einer Quantisierungsgruppengröße von 128 (typisch bei LLM-Inferenzen) kann der FLUTE-Kernel 2-4x schneller sein als bestehende GEMM-Kernels. Als Anwendung von FLUTE untersuchen wir eine einfache Erweiterung der Lookup-Tabellen-basierten NormalFloat-Quantisierung und wenden sie auf die Quantisierung von LLaMA3 in verschiedenen Konfigurationen an, wobei wir eine wettbewerbsfähige Quantisierungsleistung gegen starke Baselines erzielen und gleichzeitig eine End-to-End-Durchsatzsteigerung von 1,5 bis 2 Mal erzielen.

English

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

Schnelle Matrixmultiplikationen für Lookup-Tabellen-quantisierte LLMs.

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

Zusammenfassung

Support