FP6-LLM：透過FP6中心算法-系統共同設計有效地服務大型語言模型

摘要

六位元量化（FP6）可以有效地減小大型語言模型（LLMs）的大小，並在各種應用中持續保持模型品質。然而，現有系統並未提供對於FP6量化的Tensor Core支援，並且在LLM推論過程中難以實現實際性能改進。在GPU上支援FP6量化具有挑戰性，原因在於（1）模型權重的記憶體存取不友好，具有不規則的位元寬度，以及（2）權重反量化的高運行時開銷。為解決這些問題，我們提出了TC-FPx，這是第一個具有統一Tensor Core支援浮點權重的全套GPU核心設計方案，適用於各種量化位元寬度。我們將TC-FPx核心整合到現有的推論系統中，提供新的端對端支援（稱為FP6-LLM）以進行量化的LLM推論，實現推論成本和模型品質之間更好的折衷。實驗顯示，FP6-LLM使得僅使用單個GPU即可進行LLaMA-70b的推論，實現比FP16基準更高1.69倍至2.65倍的歸一化推論吞吐量。源代碼將很快公開提供。

English

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

FP6-LLM：透過FP6中心算法-系統共同設計有效地服務大型語言模型

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

摘要

Support