NanoQuant: 大規模言語モデルの効率的なサブ1ビット量子化

要旨

重みのみの量子化は、大規模言語モデル（LLM）を効率的に運用するための標準的な手法となっている。しかし、既存の手法では、大量のデータと計算資源を必要とするか、あるいは追加の記憶領域を必要とするため、モデルをバイナリ（1ビット）レベルまで効率的に圧縮することに失敗している。本研究では、LLMをバイナリおよびサブ1ビットレベルまで圧縮する、最初の学習後量子化（PTQ）手法であるNanoQuantを提案する。NanoQuantは量子化を低ランクの二値分解問題として定式化し、全精度の重みを低ランクの二値行列とスケールに圧縮する。具体的には、効率的な交互方向乗数法（ADMM）を利用して潜在的な二値行列とスケールを高精度に初期化し、ブロックおよびモデル再構築プロセスを通じて初期化されたパラメータを調整する。その結果、NanoQuantは低メモリ学習後量子化において新たなパレートフロンティアを確立し、サブ1ビットの圧縮率においても最先端の精度を達成する。NanoQuantは、消費者向けハードウェア上での大規模なデプロイを可能にする。例えば、単一のH100上でわずか13時間でLlama2-70Bを25.8倍圧縮し、70Bモデルを8GBの消費者向けGPU上で動作させることを可能にする。

English

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8times in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

NanoQuant: 大規模言語モデルの効率的なサブ1ビット量子化

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

要旨

Support