FP6-LLM: Servizio Efficiente di Modelli Linguistici di Grandi Dimensioni Attraverso una Co-Progettazione Algoritmo-Sistema Centrata su FP6

Abstract

La quantizzazione a sei bit (FP6) può ridurre efficacemente le dimensioni dei grandi modelli linguistici (LLM) e preservare la qualità del modello in modo coerente in varie applicazioni. Tuttavia, i sistemi esistenti non forniscono supporto per i Tensor Core nella quantizzazione FP6 e faticano a ottenere miglioramenti pratici nelle prestazioni durante l'inferenza degli LLM. È complesso supportare la quantizzazione FP6 sulle GPU a causa di (1) accesso alla memoria non ottimizzato per i pesi del modello con larghezza di bit irregolare e (2) elevato overhead runtime nella de-quantizzazione dei pesi. Per affrontare questi problemi, proponiamo TC-FPx, il primo schema di progettazione di kernel GPU full-stack con supporto unificato dei Tensor Core per pesi in virgola mobile con varie larghezze di bit di quantizzazione. Integriamo il kernel TC-FPx in un sistema di inferenza esistente, fornendo un nuovo supporto end-to-end (chiamato FP6-LLM) per l'inferenza di LLM quantizzati, dove si ottengono migliori compromessi tra costo di inferenza e qualità del modello. Gli esperimenti dimostrano che FP6-LLM consente l'inferenza di LLaMA-70b utilizzando una sola GPU, raggiungendo un throughput di inferenza normalizzato da 1,69x a 2,65x superiore rispetto alla baseline FP16. Il codice sorgente sarà presto disponibile pubblicamente.

English

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

FP6-LLM: Servizio Efficiente di Modelli Linguistici di Grandi Dimensioni Attraverso una Co-Progettazione Algoritmo-Sistema Centrata su FP6

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Abstract

Support