NanoQuant：大语言模型的高效亚1比特量化技术

摘要

仅权重量化已成为高效服务大语言模型（LLMs）的标准方法。然而，现有方法难以将模型高效压缩至二进制（1比特）级别，因为它们要么需要大量数据与算力，要么会产生额外存储开销。本研究提出NanoQuant——首个将LLMs压缩至二进制及亚1比特级别的训练后量化（PTQ）方法。该方法将量化问题构建为低秩二进制分解任务，将全精度权重压缩为低秩二进制矩阵与缩放因子。具体而言，它采用高效的交替方向乘子法（ADMM）精确初始化潜在二进制矩阵与缩放因子，再通过分块重构与模型重建过程微调初始化参数。由此，NanoQuant在低内存训练后量化领域建立了新的帕累托前沿，即使在亚1比特压缩率下仍能实现最优精度。该技术使得大规模模型部署在消费级硬件上成为可能：例如，在单张H100显卡上仅用13小时即可将Llama2-700亿参数模型压缩25.8倍，使700亿参数模型可运行于8GB显存的消费级GPU。

English

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8times in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

NanoQuant：大语言模型的高效亚1比特量化技术

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

摘要

Support