PrefixQuant:通過在LLMs中使用前綴異常值,靜態量化勝過動態量化
PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs
October 7, 2024
作者: Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo
cs.AI
摘要
量化對於部署大型語言模型(LLMs)至關重要,可增強記憶效率和推理速度。現有的激活量化方法主要處理通道異常值,通常忽略基於標記的異常值,導致依賴昂貴的每標記動態量化。為了解決這個問題,我們引入了PrefixQuant,這是一種新穎的技術,可以在無需重新訓練的情況下離線隔離異常值標記。具體來說,PrefixQuant識別高頻異常值標記並將它們前綴在KV快取中,從而在推理期間防止異常值標記的生成並簡化量化過程。據我們所知,PrefixQuant是首個實現高效的每張量靜態量化以勝過昂貴的每標記動態量化的方法。例如,在W4A4KV4(4位權重,4位激活和4位KV快取)Llama-3-8B中,PrefixQuant搭配每張量靜態量化實現了7.43的WikiText2困惑度和71.08%的5個常識推理任務的平均準確率,勝過先前的每標記動態量化方法,如QuaRot,困惑度提高了0.98,準確率增加了5.98個百分點。此外,使用PrefixQuant的W4A4量化模型的推理速度比FP16模型快1.60倍至2.81倍,超過QuaRot模型1.2倍至1.3倍。我們的程式碼可在https://github.com/ChenMnZ/PrefixQuant找到。
English
Quantization is essential for deploying Large Language Models (LLMs) by
enhancing memory efficiency and inference speed. Existing methods for
activation quantization mainly address channel-wise outliers, often neglecting
token-wise outliers, leading to reliance on costly per-token dynamic
quantization. To address this, we introduce PrefixQuant, a novel technique that
isolates outlier tokens offline without re-training. Specifically, PrefixQuant
identifies high-frequency outlier tokens and prefixes them in the KV cache,
preventing the generation of outlier tokens during inference and simplifying
quantization. To our knowledge, PrefixQuant is the first to enable efficient
per-tensor static quantization to outperform expensive per-token dynamic
quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and
4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization
achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5
common-sense reasoning tasks, outperforming previous per-token dynamic
quantization methods like QuaRot with 0.98 perplexity improvement and +5.98
points accuracy. Additionally, the inference speed of W4A4 quantized models
using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot
models by 1.2x to 1.3x. Our code is available at
https://github.com/ChenMnZ/PrefixQuant.Summary
AI-Generated Summary