LLM-FP4: 4ビット浮動小数点量子化トランスフォーマー

要旨

我々は、大規模言語モデル（LLM）の重みと活性化を4ビット浮動小数点値にポストトレーニング方式で量子化する手法「LLM-FP4」を提案する。既存のポストトレーニング量子化（PTQ）手法は主に整数ベースであり、8ビット以下のビット幅では性能が低下する。整数量子化と比較して、浮動小数点（FP）量子化はより柔軟であり、ロングテール分布やベル型分布をより適切に扱うことができ、多くのハードウェアプラットフォームでデフォルトの選択肢となっている。FP量子化の特徴の一つは、その性能が指数ビットとクリッピング範囲の選択に大きく依存することである。この点に関して、我々は最適な量子化パラメータを探索することで強力なFP-PTQベースラインを構築した。さらに、活性化分布において高いチャネル間分散と低いチャネル内分散のパターンを観察し、これが活性化量子化の難易度を高めていることを認識した。このパターンは、LLM、BERT、Vision Transformerなど、多様なタスク向けに設計されたトランスフォーマーモデル全体で一貫している。これに対処するため、我々はチャネルごとの活性化量子化を提案し、これらの追加のスケーリングファクターを重みの指数バイアスとして再パラメータ化できることを示した。これにより、無視できる程度のコストで実現可能である。我々の手法は、初めてLLaMA-13Bの重みと活性化を4ビットに量子化し、常識的ゼロショット推論タスクで平均スコア63.1を達成した。これは完全精度モデルよりもわずか5.8低いだけで、従来の最先端手法を12.7ポイント上回る大幅な性能向上を実現した。コードは以下で公開されている: https://github.com/nbasyl/LLM-FP4。

English

We propose LLM-FP4 for quantizing both weights and activations in large language models (LLMs) down to 4-bit floating-point values, in a post-training manner. Existing post-training quantization (PTQ) solutions are primarily integer-based and struggle with bit widths below 8 bits. Compared to integer quantization, floating-point (FP) quantization is more flexible and can better handle long-tail or bell-shaped distributions, and it has emerged as a default choice in many hardware platforms. One characteristic of FP quantization is that its performance largely depends on the choice of exponent bits and clipping range. In this regard, we construct a strong FP-PTQ baseline by searching for the optimal quantization parameters. Furthermore, we observe a high inter-channel variance and low intra-channel variance pattern in activation distributions, which adds activation quantization difficulty. We recognize this pattern to be consistent across a spectrum of transformer models designed for diverse tasks, such as LLMs, BERT, and Vision Transformer models. To tackle this, we propose per-channel activation quantization and show that these additional scaling factors can be reparameterized as exponential biases of weights, incurring a negligible cost. Our method, for the first time, can quantize both weights and activations in the LLaMA-13B to only 4-bit and achieves an average score of 63.1 on the common sense zero-shot reasoning tasks, which is only 5.8 lower than the full-precision model, significantly outperforming the previous state-of-the-art by 12.7 points. Code is available at: https://github.com/nbasyl/LLM-FP4.

LLM-FP4: 4ビット浮動小数点量子化トランスフォーマー

LLM-FP4: 4-Bit Floating-Point Quantized Transformers

要旨

Support