ChatPaper.aiChatPaper

TP感知去量化

TP-Aware Dequantization

January 15, 2024
作者: Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
cs.AI

摘要

本文提出了一种新颖的方法,用于在大型语言模型(LLMs)分布式部署期间降低模型推断延迟。我们的贡献是一种经过优化的推断部署方案,解决了目前最先进的量化内核与张量并行(TP)结合使用时的局限性。我们的方法保留了GPU内存访问模式中的数据局部性,并利用TP的先验知识来减少全局通信。我们展示了在A100和H100 NVIDIA DGX系统上,针对各种TP设置,对于Llama-70B和IBM WatsonX的Granite-20B MLP层问题规模,相对于现有方法最多可实现1.81倍的加速和最多可实现1.78倍的加速。
English
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.
PDF42December 15, 2024