QMoE: 1ビット未満の実用的な圧縮による兆パラメータモデルの圧縮

要旨

Mixture-of-Experts (MoE) アーキテクチャは、大規模言語モデル (LLMs) の高い推論コストに対する一般的な解決策を提供し、スパースルーティングを通じてより高速で正確なモデルを実現しますが、その代償として膨大なパラメータ数を必要とします。例えば、SwitchTransformer-c2048 モデルは 1.6 兆のパラメータを持ち、効率的に実行するためには 3.2TB のアクセラレータメモリを必要とします。これは実用的な展開を困難かつ高価なものにしています。本論文では、このメモリ問題に対する解決策として、QMoE と呼ばれる新しい圧縮および実行フレームワークを提案します。具体的には、QMoE は、カスタム形式で設計された GPU デコードカーネルと連携し、効率的なエンドツーエンドの圧縮推論を可能にするスケーラブルなアルゴリズムで構成されており、非圧縮実行と比較してわずかなランタイムオーバーヘッドで、兆単位のパラメータを持つ MoE を 1 パラメータあたり 1 ビット未満に正確に圧縮します。具体的には、QMoE は 1.6 兆パラメータの SwitchTransformer-c2048 モデルを、わずかな精度損失で 160GB 未満 (20 倍の圧縮、1 パラメータあたり 0.8 ビット) に圧縮し、単一の GPU で 1 日未満で完了します。これにより、初めて、4 台の NVIDIA A6000 または 8 台の NVIDIA 3090 GPU を搭載した単一のサーバーなどの手頃な市販ハードウェア上で、理想的な非圧縮推論と比較して 5% 未満のランタイムオーバーヘッドで、兆単位のパラメータモデルを実行することが可能になります。ソースコードと圧縮モデルは github.com/IST-DASLab/qmoe で公開されています。

English

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.

QMoE: 1ビット未満の実用的な圧縮による兆パラメータモデルの圧縮

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

要旨

Support