FP6-LLM: FP6中心のアルゴリズム-システム協調設計による大規模言語モデルの効率的な提供

要旨

6ビット量子化（FP6）は、大規模言語モデル（LLM）のサイズを効果的に削減し、さまざまなアプリケーションにおいてモデルの品質を一貫して維持することができます。しかし、既存のシステムはFP6量子化に対するTensor Coreサポートを提供しておらず、LLM推論時の実用的な性能向上を達成するのに苦労しています。FP6量子化をGPUでサポートすることは、以下の理由から困難です：（1）不規則なビット幅を持つモデル重みのメモリアクセスが非効率的であること、（2）重みの逆量子化における高いランタイムオーバーヘッド。これらの問題を解決するために、我々はTC-FPxを提案します。これは、さまざまな量子化ビット幅の浮動小数点重みに対する統一されたTensor Coreサポートを備えた初めてのフルスタックGPUカーネル設計スキームです。TC-FPxカーネルを既存の推論システムに統合し、量子化されたLLM推論のための新しいエンドツーエンドサポート（FP6-LLMと呼ぶ）を提供します。これにより、推論コストとモデル品質の間のより良いトレードオフが実現されます。実験結果から、FP6-LLMは単一のGPUを使用してLLaMA-70bの推論を可能にし、FP16ベースラインと比較して1.69倍から2.65倍の正規化推論スループットを達成することが示されました。ソースコードは近日中に公開される予定です。

English

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6 quantization and struggle to achieve practical performance improvements during LLM inference. It is challenging to support FP6 quantization on GPUs due to (1) unfriendly memory access of model weights with irregular bit-width and (2) high runtime overhead of weight de-quantization. To address these problems, we propose TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of float-point weights for various quantization bit-width. We integrate TC-FPx kernel into an existing inference system, providing new end-to-end support (called FP6-LLM) for quantized LLM inference, where better trade-offs between inference cost and model quality are achieved. Experiments show that FP6-LLM enables the inference of LLaMA-70b using only a single GPU, achieving 1.69x-2.65x higher normalized inference throughput than the FP16 baseline. The source code will be publicly available soon.

FP6-LLM: FP6中心のアルゴリズム-システム協調設計による大規模言語モデルの効率的な提供

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

要旨

Support