「BF16を与えよ、さもなくば死を！」？LLM量子化における精度と性能のトレードオフ

要旨

大規模言語モデル（LLM）の量子化は推論の高速化において人気がありますが、さまざまな量子化フォーマットに関連する精度と性能のトレードオフについては依然として重要な不確実性が残っています。本研究では、一連の学術ベンチマークと実世界のタスクで人気のある量子化フォーマット（FP8、INT8、INT4）を評価し、Llama-3.1モデルファミリー全体で量子化された精度について包括的な実証的研究を行います。さらに、本研究では、量子化モデルによって生成されたテキストと非圧縮の対応物との違いも検討します。ベンチマークに加えて、最先端の精度回復結果を得るために行ったいくつかの量子化改善策も紹介します。50万以上の個別評価を含む当該調査により、以下のいくつかの重要な結果が得られました：（1）FP8の重みと活性化量子化（W8A8-FP）はすべてのモデルスケールで損失がないこと、（2）INT8の重みと活性化量子化（W8A8-INT）は適切に調整された場合、驚くほど1-3%の精度低下しか発生せず、（3）INT4の重みのみの量子化（W4A16-INT）は8ビット整数の重みと活性化量子化と競合しています。特定の展開環境に最適なフォーマットに関する問題に対処するため、一般的なオープンソースのvLLMフレームワークを使用してさまざまなGPUアーキテクチャで推論性能を分析します。その結果、W4A16が同期展開において最もコスト効率が良く、中堅GPUでの非同期展開に最適であることがわかりました。同時に、W8A8フォーマットは高性能GPUでの中規模および大規模モデルの非同期「連続バッチング」展開に優れています。我々の結果は、さまざまなスケールと性能要件にわたる量子化されたLLMの展開に関する実用的なガイドラインを提供しています。

English

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the "best" format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous "continuous batching" deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

「BF16を与えよ、さもなくば死を！」？LLM量子化における精度と性能のトレードオフ

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

要旨

Support