룩업 테이블 양자화된 대형 언어 모델을 위한 고속 행렬 곱셈

초록

대규모 언어 모델(LLM)의 배포는 종종 메모리 대역폭에 의해 제약을 받으며, 주요 병목 현상은 GPU의 전역 메모리에서 레지스터로 모델 파라미터를 전송하는 비용에서 발생합니다. 디양자화(dequantization)와 행렬 곱셈(matmul) 연산을 융합한 커스텀 커널과 결합할 때, 가중치 전용 양자화(weight-only quantization)는 메모리 이동량을 줄여 더 빠른 추론을 가능하게 합니다. 그러나 가중치가 비균일한 룩업 테이블(LUT) 양자화를 통해 균등하지 않은 비트 폭(예: 3비트)으로 압축된 경우, 가중치 양자화된 LLM을 위한 고성능 커널을 개발하는 것은 상당한 도전 과제를 제시합니다. 본 논문은 LUT 양자화된 LLM을 위한 유연한 룩업 테이블 엔진인 FLUTE를 소개하며, 이는 양자화된 가중치 행렬의 오프라인 재구성을 통해 언패킹(unpacking)과 관련된 비트 조작을 최소화하고, 룩업 테이블의 벡터화 및 복제를 통해 공유 메모리 대역폭 제약을 완화합니다. 배치 크기가 32 미만이고 양자화 그룹 크기가 128(LLM 추론에서 일반적)인 경우, FLUTE 커널은 기존 GEMM 커널보다 2~4배 빠를 수 있습니다. FLUTE의 응용으로, 룩업 테이블 기반 NormalFloat 양자화의 간단한 확장을 탐구하고 이를 LLaMA3를 다양한 구성으로 양자화하는 데 적용하여, 강력한 베이스라인 대비 경쟁력 있는 양자화 성능을 달성하면서 엔드투엔드 처리량을 1.5~2배 증가시켰습니다.

English

The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.

룩업 테이블 양자화된 대형 언어 모델을 위한 고속 행렬 곱셈

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

초록

Support