Atom：用於高效且準確的低位量化LLM服務

摘要

在應用領域中，如內容生成、智能聊天機器人和情感分析等，對於大型語言模型（LLMs）的需求不斷增長，這為LLM服務提供商帶來了相當大的挑戰。為了有效利用GPU資源並提高吞吐量，批量處理多個請求已成為一種流行的範式；為了進一步加快批量處理速度，LLM量化技術減少了內存消耗並增加了計算能力。然而，目前普遍的量化方案（例如8位權重-激活量化）無法充分利用現代GPU的功能，例如4位整數運算器，導致性能不佳。為了最大化LLMs的服務吞吐量，我們引入了Atom，一種低位量化方法，實現了高吞吐量改進，並具有可忽略的準確性損失。Atom通過使用低位運算器顯著提高服務吞吐量，並通過低位量化大幅減少內存消耗。它通過應用新穎的混合精度和細粒度量化過程實現高準確性。我們在服務上下文中的4位權重-激活量化設置上評估了Atom。與FP16相比，Atom將端到端吞吐量提高了最多7.73倍，與INT8相比提高了2.53倍，同時保持相同的延遲目標。

English

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73times compared to the FP16 and by 2.53times compared to INT8 quantization, while maintaining the same latency target.

Atom：用於高效且準確的低位量化LLM服務

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

摘要

Support