LLM-FP4:4 位元浮點量化 Transformer
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
October 25, 2023
作者: Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
cs.AI
摘要
我們提出 LLM-FP4 來將大型語言模型(LLMs)中的權重和激活量化為 4 位浮點值,以後訓練方式進行。現有的後訓練量化(PTQ)解決方案主要基於整數,對於低於 8 位的位寬感到困難。與整數量化相比,浮點(FP)量化更靈活,可以更好地處理長尾或鐘形分佈,已成為許多硬體平台的默認選擇。FP 量化的一個特點是其性能在很大程度上取決於指數位和剪切範圍的選擇。在這方面,我們通過尋找最佳量化參數構建了一個強大的 FP-PTQ 基準線。此外,我們觀察到激活分佈中存在著高通道間變異和低通道內變異的模式,這增加了激活量化的難度。我們認識到這種模式在設計用於不同任務的一系列變壓器模型(如LLMs、BERT 和 Vision Transformer 模型)中是一致的。為了應對這一問題,我們提出了逐通道激活量化,並展示這些額外的縮放因子可以重新參數化為權重的指數偏差,帶來可忽略的成本。我們的方法首次能夠將 LLaMA-13B 中的權重和激活量化為僅 4 位,並在常識零-shot 推理任務上實現了平均得分為 63.1,比全精度模型低僅 5.8,明顯優於先前的最新技術 12.7 分。代碼可在以下鏈接找到:https://github.com/nbasyl/LLM-FP4。
English
We propose LLM-FP4 for quantizing both weights and activations in large
language models (LLMs) down to 4-bit floating-point values, in a post-training
manner. Existing post-training quantization (PTQ) solutions are primarily
integer-based and struggle with bit widths below 8 bits. Compared to integer
quantization, floating-point (FP) quantization is more flexible and can better
handle long-tail or bell-shaped distributions, and it has emerged as a default
choice in many hardware platforms. One characteristic of FP quantization is
that its performance largely depends on the choice of exponent bits and
clipping range. In this regard, we construct a strong FP-PTQ baseline by
searching for the optimal quantization parameters. Furthermore, we observe a
high inter-channel variance and low intra-channel variance pattern in
activation distributions, which adds activation quantization difficulty. We
recognize this pattern to be consistent across a spectrum of transformer models
designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models.
To tackle this, we propose per-channel activation quantization and show that
these additional scaling factors can be reparameterized as exponential biases
of weights, incurring a negligible cost. Our method, for the first time, can
quantize both weights and activations in the LLaMA-13B to only 4-bit and
achieves an average score of 63.1 on the common sense zero-shot reasoning
tasks, which is only 5.8 lower than the full-precision model, significantly
outperforming the previous state-of-the-art by 12.7 points. Code is available
at: https://github.com/nbasyl/LLM-FP4.