原子：用于高效准确的低比特量化的LLM服务

摘要

在内容生成、智能聊天机器人和情感分析等应用中对大型语言模型（LLMs）日益增长的需求，给LLM服务提供商带来了相当大的挑战。为了高效利用GPU资源并提高吞吐量，批处理多个请求已成为一种流行的范式；为了进一步加快批处理速度，LLM量化技术减少内存消耗并增加计算能力。然而，普遍存在的量化方案（例如8位权重-激活量化）无法充分利用现代GPU的能力，比如4位整数运算器，导致性能不佳。为了最大化LLMs的服务吞吐量，我们引入了Atom，一种低位量化方法，实现了高吞吐量改进，几乎没有准确性损失。Atom通过使用低位运算器显著提高服务吞吐量，并通过低位量化大幅减少内存消耗。它通过应用新颖的混合精度和细粒度量化过程实现高准确性。我们在服务环境中的4位权重-激活量化设置上评估了Atom。Atom相比于FP16提高了端到端吞吐量高达7.73倍，相比于INT8量化提高了2.53倍，同时保持相同的延迟目标。

English

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization setups in the serving context. Atom improves end-to-end throughput by up to 7.73times compared to the FP16 and by 2.53times compared to INT8 quantization, while maintaining the same latency target.

原子：用于高效准确的低比特量化的LLM服务

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

摘要

Support