ChatPaper.aiChatPaper

TP感知去量化

TP-Aware Dequantization

January 15, 2024
作者: Adnan Hoque, Mudhakar Srivatsa, Chih-Chieh Yang, Raghu Ganti
cs.AI

摘要

本文提出了一種新穎的方法,用於在大型語言模型(LLMs)的分佈式部署期間減少模型推論延遲。我們的貢獻是一種優化的推論部署方案,解決了當前最先進的量化內核與Tensor Parallel(TP)結合使用時的限制。我們的方法保留了GPU內存訪問模式中的數據局部性,並利用TP的先驗知識來減少全局通信。我們展示了在A100和H100 NVIDIA DGX系統上,對於各種TP設置,相對於現有方法,對於Llama-70B可達到高達1.81倍的加速,對於IBM WatsonX的Granite-20B MLP層問題尺寸可達到高達1.78倍的加速。
English
In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.
PDF42December 15, 2024