FP6-LLM：通过FP6中心算法-系统共同设计高效服务大型语言模型

摘要

六位量化（FP6）可以有效地减小大型语言模型（LLMs）的大小，并在各种应用中始终保持模型质量。然而，现有系统并未为FP6量化提供张量核心支持，并且在LLM推理过程中难以实现实际性能改进。由于（1）模型权重的内存访问不友好且具有不规则的位宽，以及（2）权重去量化的高运行时开销，因此在GPU上支持FP6量化具有挑战性。为解决这些问题，我们提出了TC-FPx，这是第一个完整的GPU核心设计方案，统一支持各种量化位宽的浮点权重的张量核心。我们将TC-FPx核心集成到现有推理系统中，提供新的端到端支持（称为FP6-LLM）以进行量化LLM推理，从而实现推理成本和模型质量之间更好的权衡。实验表明，FP6-LLM使得可以仅使用单个GPU推理LLaMA-70b，其标准化推理吞吐量比FP16基准提高了1.69倍至2.65倍。源代码将很快公开发布。

English

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

FP6-LLM：通过FP6中心算法-系统共同设计高效服务大型语言模型

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

摘要

Support