ChatPaper.aiChatPaper

PrefixQuant:通过LLMs中的前缀异常值,静态量化胜过动态量化

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

October 7, 2024
作者: Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo
cs.AI

摘要

量化对于部署大型语言模型(LLMs)至关重要,可提高内存效率和推理速度。现有的激活量化方法主要解决通道异常值,通常忽略基于令牌的异常值,导致依赖昂贵的每个令牌动态量化。为了解决这个问题,我们引入了PrefixQuant,一种新颖的技术,可以在离线状态下隔离异常值令牌而无需重新训练。具体来说,PrefixQuant识别高频异常值令牌并将它们作为KV缓存中的前缀,从而在推理过程中防止生成异常值令牌,并简化量化过程。据我们所知,PrefixQuant是第一个能够实现高效的每张量静态量化以胜过昂贵的每个令牌动态量化的方法。例如,在W4A4KV4(4位权重,4位激活和4位KV缓存)Llama-3-8B中,使用每张量静态量化的PrefixQuant在5个常识推理任务中取得了7.43的WikiText2困惑度和71.08%的平均准确率,胜过了之前的每个令牌动态量化方法,如QuaRot,困惑度提高了0.98,准确率提高了+5.98个百分点。此外,使用PrefixQuant的W4A4量化模型的推理速度比FP16模型快1.60倍至2.81倍,并超过QuaRot模型1.2倍至1.3倍。我们的代码可在https://github.com/ChenMnZ/PrefixQuant找到。
English
Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at https://github.com/ChenMnZ/PrefixQuant.

Summary

AI-Generated Summary

PDF312November 16, 2024