BiLLM：突破LLM後訓練量化的極限

摘要

預訓練的大型語言模型（LLMs）展現出卓越的通用語言處理能力，但對記憶體和計算資源有著重大需求。作為一項強大的壓縮技術，二值化可以將模型權重極端地減少至僅 1 位元，降低昂貴的計算和記憶體需求。然而，現有的量化技術無法在超低位元寬度下保持LLM的性能。為應對這一挑戰，我們提出了BiLLM，一種為預訓練的LLMs量身定制的開創性 1 位元後訓練量化方案。基於LLMs的權重分佈，BiLLM首先識別並結構性地選擇顯著權重，並通過有效的二值殘差逼近策略最小化壓縮損失。此外，考慮到非顯著權重的鐘形分佈，我們提出了一種最優分割搜索來準確分組和二值化它們。BiLLM首次實現了在各種LLMs家族和評估指標上僅使用 1.08 位元權重即達到高準確度推理（例如在LLaMA2-70B上的 8.41 困惑度），並且在LLMs的SOTA量化方法方面表現優異。此外，BiLLM使得在單個GPU上在 0.5 小時內將擁有 70 億權重的LLMs進行二值化處理，展現了令人滿意的時間效率。

English

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

BiLLM：突破LLM後訓練量化的極限

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

摘要

Support