NanoQuant: Effiziente Sub-1-Bit-Quantisierung großer Sprachmodelle

papers.abstract

Gewichts-Only-Quantisierung hat sich als Standardansatz für die effiziente Bereitstellung großer Sprachmodelle (LLMs) etabliert. Bestehende Methoden scheitern jedoch daran, Modelle effizient auf binäre (1-Bit) Niveaus zu komprimieren, da sie entweder große Mengen an Daten und Rechenleistung erfordern oder zusätzlichen Speicherbedarf verursachen. In dieser Arbeit stellen wir NanoQuant vor, die erste Post-Training-Quantisierungsmethode (PTQ), die LLMs sowohl auf binäre als auch auf Sub-1-Bit-Niveaus komprimiert. NanoQuant formuliert Quantisierung als ein Problem der niedrigrangigen binären Faktorisierung und komprimiert Full-Precision-Gewichte zu niedrigrangigen binären Matrizen und Skalierungsfaktoren. Konkret nutzt es eine effiziente Alternating Direction Method of Multipliers (ADMM), um latente binäre Matrizen und Skalierungsfaktoren präzise zu initialisieren, und justiert die initialisierten Parameter anschließend durch einen Block- und Modellrekonstruktionsprozess. Folglich setzt NanoQuant eine neue Pareto-Grenze in der speicherarmen Post-Training-Quantisierung und erreicht state-of-the-art Genauigkeit selbst bei Sub-1-Bit-Kompressionsraten. NanoQuant macht die großflächige Bereitstellung auf Consumer-Hardware praktikabel. Beispielsweise komprimiert es Llama2-70B um das 25,8-fache in nur 13 Stunden auf einer einzelnen H100, was den Betrieb eines 70B-Modells auf einer Consumer-8-GB-GPU ermöglicht.

English

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8times in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

NanoQuant: Effiziente Sub-1-Bit-Quantisierung großer Sprachmodelle

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

papers.abstract

Support