ChatPaper.aiChatPaper

QMoE:兆參數模型實用的次比特壓縮

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

October 25, 2023
作者: Elias Frantar, Dan Alistarh
cs.AI

摘要

專家混合模型(MoE)架構通過稀疏路由提供了一個通用解決方案,以應對大型語言模型(LLMs)高推理成本的問題,帶來更快速和更準確的模型,但代價是龐大的參數數量。例如,SwitchTransformer-c2048 模型擁有 1.6 兆個參數,需要 3.2TB 的加速器內存才能有效運行,這使得實際部署變得具有挑戰性和昂貴。在本文中,我們提出了一個解決這個內存問題的方案,即一個名為 QMoE 的新壓縮和執行框架。具體來說,QMoE 包括一種可擴展的算法,可以將兆級參數的 MoE 準確壓縮到每個參數不到 1 位的程度,並採用與專用 GPU 解碼內核共同設計的自定義格式,以促進高效的端到端壓縮推理,相對於未壓縮的執行,僅具有輕微的運行時開銷。具體而言,QMoE 可以將 1.6 兆個參數的 SwitchTransformer-c2048 模型壓縮到不到 160GB(20倍壓縮,每個參數 0.8 位),僅有輕微的準確性損失,在單個 GPU 上不到一天的時間內完成。這使得首次可以在價格實惠的通用硬件上執行兆級參數模型,例如搭載 4 個 NVIDIA A6000 或 8 個 NVIDIA 3090 GPU 的單個伺服器,相對於理想未壓縮推理,運行時開銷不到 5%。源代碼和壓縮模型可在 github.com/IST-DASLab/qmoe 上獲得。
English
Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.
PDF273December 15, 2024