Atom: Quantizzazione a basso bit per un servizio efficiente e accurato di LLM

Abstract

La crescente domanda di modelli linguistici di grandi dimensioni (LLM) in applicazioni come la generazione di contenuti, chatbot intelligenti e l'analisi del sentiment pone sfide considerevoli per i fornitori di servizi LLM. Per utilizzare in modo efficiente le risorse GPU e aumentare il throughput, il raggruppamento di più richieste (batching) è emerso come un paradigma popolare; per accelerare ulteriormente il batching, le tecniche di quantizzazione degli LLM riducono il consumo di memoria e aumentano la capacità di calcolo. Tuttavia, gli schemi di quantizzazione prevalenti (ad esempio, la quantizzazione a 8 bit di pesi e attivazioni) non riescono a sfruttare appieno le capacità delle GPU moderne, come gli operatori interi a 4 bit, risultando in prestazioni sub-ottimali. Per massimizzare il throughput di servizio degli LLM, introduciamo Atom, un metodo di quantizzazione a basso bit che ottiene miglioramenti significativi del throughput con una perdita di precisione trascurabile. Atom aumenta notevolmente il throughput di servizio utilizzando operatori a basso bit e riduce considerevolmente il consumo di memoria attraverso la quantizzazione a basso bit. Raggiunge un'elevata precisione applicando un innovativo processo di quantizzazione mista a precisione variabile e granulare. Valutiamo Atom in contesti di servizio con configurazioni di quantizzazione a 4 bit di pesi e attivazioni. Atom migliora il throughput end-to-end fino a 7,73 volte rispetto alla quantizzazione FP16 e a 2,53 volte rispetto alla quantizzazione INT8, mantenendo lo stesso obiettivo di latenza.

English

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73times compared to the FP16 and by 2.53times compared to INT8 quantization, while maintaining the same latency target.

Atom: Quantizzazione a basso bit per un servizio efficiente e accurato di LLM

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Abstract

Support