BiLLM: 大規模言語モデルのポストトレーニング量子化の限界に挑む

要旨

事前学習済みの大規模言語モデル（LLM）は、優れた汎用言語処理能力を発揮する一方で、メモリと計算リソースに多大な要求を伴います。強力な圧縮技術として、二値化はモデルの重みをわずか1ビットまで極端に削減し、高価な計算とメモリ要件を低減します。しかし、既存の量子化技術は、超低ビット幅下でのLLMの性能維持に十分ではありません。この課題に対応するため、我々は事前学習済みLLMに特化した画期的な1ビットの学習後量子化スキームであるBiLLMを提案します。BiLLMは、LLMの重み分布に基づいて、まず重要な重みを識別し構造的に選択し、効果的な二値残差近似戦略を通じて圧縮損失を最小化します。さらに、非重要重みのベル型分布を考慮し、それらを正確にグループ化して二値化するための最適分割探索を提案します。BiLLMは、様々なLLMファミリーと評価指標において、わずか1.08ビットの重みで初めて高精度な推論（例：LLaMA2-70Bで8.41のパープレキシティ）を達成し、SOTAのLLM量子化手法を大幅に上回ります。さらに、BiLLMは、70億の重みを持つLLMの二値化プロセスを単一のGPU上で0.5時間以内に完了させ、満足のいく時間効率を実証します。

English

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

BiLLM: 大規模言語モデルのポストトレーニング量子化の限界に挑む

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

要旨

Support