FP6-LLM:通过FP6中心算法-系统共同设计高效服务大型语言模型
FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design
January 25, 2024
作者: Haojun Xia, Zhen Zheng, Xiaoxia Wu, Shiyang Chen, Zhewei Yao, Stephen Youn, Arash Bakhtiari, Michael Wyatt, Donglin Zhuang, Zhongzhu Zhou, Olatunji Ruwase, Yuxiong He, Shuaiwen Leon Song
cs.AI
摘要
六位量化(FP6)可以有效地减小大型语言模型(LLMs)的大小,并在各种应用中始终保持模型质量。然而,现有系统并未为FP6量化提供张量核心支持,并且在LLM推理过程中难以实现实际性能改进。由于(1)模型权重的内存访问不友好且具有不规则的位宽,以及(2)权重去量化的高运行时开销,因此在GPU上支持FP6量化具有挑战性。为解决这些问题,我们提出了TC-FPx,这是第一个完整的GPU核心设计方案,统一支持各种量化位宽的浮点权重的张量核心。我们将TC-FPx核心集成到现有推理系统中,提供新的端到端支持(称为FP6-LLM)以进行量化LLM推理,从而实现推理成本和模型质量之间更好的权衡。实验表明,FP6-LLM使得可以仅使用单个GPU推理LLaMA-70b,其标准化推理吞吐量比FP16基准提高了1.69倍至2.65倍。源代码将很快公开发布。
English
Six-bit quantization (FP6) can effectively reduce the size of large language
models (LLMs) and preserve the model quality consistently across varied
applications. However, existing systems do not provide Tensor Core support for
FP6 quantization and struggle to achieve practical performance improvements
during LLM inference. It is challenging to support FP6 quantization on GPUs due
to (1) unfriendly memory access of model weights with irregular bit-width and
(2) high runtime overhead of weight de-quantization. To address these problems,
we propose TC-FPx, the first full-stack GPU kernel design scheme with unified
Tensor Core support of float-point weights for various quantization bit-width.
We integrate TC-FPx kernel into an existing inference system, providing new
end-to-end support (called FP6-LLM) for quantized LLM inference, where better
trade-offs between inference cost and model quality are achieved. Experiments
show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU,
achieving 1.69x-2.65x higher normalized inference throughput than the FP16
baseline. The source code will be publicly available soon.