TP感知去量化

摘要

本文提出了一种新颖的方法，用于在大型语言模型（LLMs）分布式部署期间降低模型推断延迟。我们的贡献是一种经过优化的推断部署方案，解决了目前最先进的量化内核与张量并行（TP）结合使用时的局限性。我们的方法保留了GPU内存访问模式中的数据局部性，并利用TP的先验知识来减少全局通信。我们展示了在A100和H100 NVIDIA DGX系统上，针对各种TP设置，对于Llama-70B和IBM WatsonX的Granite-20B MLP层问题规模，相对于现有方法最多可实现1.81倍的加速和最多可实现1.78倍的加速。

English

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

TP感知去量化

TP-Aware Dequantization

摘要

Support