QMoE：万亿参数模型实用的次比特压缩

摘要

混合专家（MoE）架构通过稀疏路由为大型语言模型（LLMs）的高推理成本提供了一种通用解决方案，带来了更快、更准确的模型，但代价是庞大的参数数量。例如，SwitchTransformer-c2048模型具有1.6万亿参数，需要3.2TB的加速器内存才能高效运行，这使得实际部署具有挑战性且昂贵。在本文中，我们提出了一种解决这一内存问题的解决方案，即一种名为QMoE的新压缩和执行框架。具体而言，QMoE包括一种可扩展算法，可以将万亿参数的MoE精确地压缩到每个参数不到1位的水平，采用与定制GPU解码内核共同设计的专用格式，以促进高效的端到端压缩推理，相对于未压缩的执行，只有轻微的运行时开销。具体而言，QMoE可以将1.6万亿参数的SwitchTransformer-c2048模型压缩到不到160GB（20倍压缩，每个参数0.8位），仅有轻微的精度损失，在单个GPU上不到一天的时间内完成。这使得首次可以在廉价的商品硬件上执行万亿参数模型，例如配备4个NVIDIA A6000或8个NVIDIA 3090 GPU的单服务器，相对于理想未压缩推理，运行时开销不到5%。源代码和压缩模型可在github.com/IST-DASLab/qmoe获得。

English

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.

QMoE：万亿参数模型实用的次比特压缩

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

摘要

Support