TP対応デクォンタイゼーション

要旨

本論文では、大規模言語モデル（LLM）の分散デプロイメントにおけるモデル推論レイテンシを低減する新規手法を提案する。我々の貢献は、テンソル並列（TP）と併用した際の最先端量子化カーネルの現状の制約に対処する最適化された推論デプロイメントスキームである。本手法は、GPUメモリアクセスパターンにおけるデータ局所性を維持し、TPの事前知識を活用してグローバル通信を削減する。A100およびH100 NVIDIA DGXシステム上で、様々なTP設定において、Llama-70Bでは既存手法に対して最大1.81倍、IBM WatsonXのGranite-20B MLP層問題サイズでは最大1.78倍の高速化を実証した。

English

In this paper, we present a novel method that reduces model inference latency during distributed deployment of Large Language Models (LLMs). Our contribution is an optimized inference deployment scheme that address the current limitations of state-of-the-art quantization kernels when used in conjunction with Tensor Parallel (TP). Our method preserves data locality in GPU memory access patterns and exploits a priori knowledge of TP to reduce global communication. We demonstrate an up to 1.81x speedup over existing methods for Llama-70B and up to 1.78x speedup for IBM WatsonX's Granite-20B MLP layer problem sizes on A100 and H100 NVIDIA DGX Systems for a variety of TP settings.

TP対応デクォンタイゼーション

TP-Aware Dequantization

要旨

Support