LLM-FP4：4 位元浮點量化 Transformer

摘要

我們提出 LLM-FP4 來將大型語言模型（LLMs）中的權重和激活量化為 4 位浮點值，以後訓練方式進行。現有的後訓練量化（PTQ）解決方案主要基於整數，對於低於 8 位的位寬感到困難。與整數量化相比，浮點（FP）量化更靈活，可以更好地處理長尾或鐘形分佈，已成為許多硬體平台的默認選擇。FP 量化的一個特點是其性能在很大程度上取決於指數位和剪切範圍的選擇。在這方面，我們通過尋找最佳量化參數構建了一個強大的 FP-PTQ 基準線。此外，我們觀察到激活分佈中存在著高通道間變異和低通道內變異的模式，這增加了激活量化的難度。我們認識到這種模式在設計用於不同任務的一系列變壓器模型（如LLMs、BERT 和 Vision Transformer 模型）中是一致的。為了應對這一問題，我們提出了逐通道激活量化，並展示這些額外的縮放因子可以重新參數化為權重的指數偏差，帶來可忽略的成本。我們的方法首次能夠將 LLaMA-13B 中的權重和激活量化為僅 4 位，並在常識零-shot 推理任務上實現了平均得分為 63.1，比全精度模型低僅 5.8，明顯優於先前的最新技術 12.7 分。代碼可在以下鏈接找到：https://github.com/nbasyl/LLM-FP4。

English

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization difficulty. We recognize this pattern to be consistent across a spectrum of transformer models designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. To tackle this, we propose per-channel activation quantization and show that these additional scaling factors can be reparameterized as exponential biases of weights, incurring a negligible cost. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1 on the common sense zero-shot reasoning tasks, which is only 5.8 lower than the full-precision model, significantly outperforming the previous state-of-the-art by 12.7 points. Code is available at: https://github.com/nbasyl/LLM-FP4.

LLM-FP4：4 位元浮點量化 Transformer

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

摘要

Support