BiLLM：将LLM后训练量化的极限推至极致

摘要

预训练的大型语言模型（LLMs）展现出出色的通用语言处理能力，但对内存和计算资源有着巨大需求。作为一种强大的压缩技术，二值化可以将模型权重极端减少至仅 1 位，降低昂贵的计算和内存需求。然而，现有的量化技术未能在极低位宽下保持 LLM 的性能。针对这一挑战，我们提出了 BiLLM，这是一种为预训练的 LLMs 定制的开创性 1 位后训练量化方案。基于 LLMs 的权重分布，BiLLM 首先识别并结构选择显著权重，并通过有效的二进制残差逼近策略最小化压缩损失。此外，考虑到非显著权重的钟形分布，我们提出了一种最优分割搜索方法，以准确分组和二值化它们。BiLLM 首次实现了在各种 LLMs 家族和评估指标下仅使用 1.08 位权重即实现高准确性推断（例如在 LLaMA2-70B 上的 8.41 困惑度），明显优于当前最先进的 LLM 量化方法。此外，BiLLM 能够在单个 GPU 上在 0.5 小时内完成对拥有 70 亿权重的 LLM 的二值化过程，展现了令人满意的时间效率。

English

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

BiLLM：将LLM后训练量化的极限推至极致

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

摘要

Support