PrefixQuant：通過在LLMs中使用前綴異常值，靜態量化勝過動態量化

摘要

量化對於部署大型語言模型（LLMs）至關重要，可增強記憶效率和推理速度。現有的激活量化方法主要處理通道異常值，通常忽略基於標記的異常值，導致依賴昂貴的每標記動態量化。為了解決這個問題，我們引入了PrefixQuant，這是一種新穎的技術，可以在無需重新訓練的情況下離線隔離異常值標記。具體來說，PrefixQuant識別高頻異常值標記並將它們前綴在KV快取中，從而在推理期間防止異常值標記的生成並簡化量化過程。據我們所知，PrefixQuant是首個實現高效的每張量靜態量化以勝過昂貴的每標記動態量化的方法。例如，在W4A4KV4（4位權重，4位激活和4位KV快取）Llama-3-8B中，PrefixQuant搭配每張量靜態量化實現了7.43的WikiText2困惑度和71.08％的5個常識推理任務的平均準確率，勝過先前的每標記動態量化方法，如QuaRot，困惑度提高了0.98，準確率增加了5.98個百分點。此外，使用PrefixQuant的W4A4量化模型的推理速度比FP16模型快1.60倍至2.81倍，超過QuaRot模型1.2倍至1.3倍。我們的程式碼可在https://github.com/ChenMnZ/PrefixQuant找到。

English

Quantization is essential for deploying Large Language Models (LLMs) by enhancing memory efficiency and inference speed. Existing methods for activation quantization mainly address channel-wise outliers, often neglecting token-wise outliers, leading to reliance on costly per-token dynamic quantization. To address this, we introduce PrefixQuant, a novel technique that isolates outlier tokens offline without re-training. Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization. To our knowledge, PrefixQuant is the first to enable efficient per-tensor static quantization to outperform expensive per-token dynamic quantization. For instance, in W4A4KV4 (4- bit weight, 4-bit activation, and 4-bit KV cache) Llama-3-8B, PrefixQuant with per-tensor static quantization achieves a 7.43 WikiText2 perplexity and 71.08% average accuracy on 5 common-sense reasoning tasks, outperforming previous per-token dynamic quantization methods like QuaRot with 0.98 perplexity improvement and +5.98 points accuracy. Additionally, the inference speed of W4A4 quantized models using PrefixQuant is 1.60x to 2.81x faster than FP16 models and exceeds QuaRot models by 1.2x to 1.3x. Our code is available at https://github.com/ChenMnZ/PrefixQuant.

PrefixQuant：通過在LLMs中使用前綴異常值，靜態量化勝過動態量化

PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

摘要

Support