BiLLM: Spingere al limite la quantizzazione post-addestramento per i modelli linguistici di grandi dimensioni

Abstract

I modelli linguistici di grandi dimensioni pre-addestrati (LLM) dimostrano capacità eccezionali di elaborazione del linguaggio generale, ma richiedono risorse significative in termini di memoria e calcolo. Come potente tecnologia di compressione, la binarizzazione può ridurre estremamente i pesi del modello a soli 1 bit, abbattendo i costosi requisiti di calcolo e memoria. Tuttavia, le tecniche di quantizzazione esistenti non riescono a mantenere le prestazioni degli LLM con larghezze di bit ultra-ridotte. In risposta a questa sfida, presentiamo BiLLM, un innovativo schema di quantizzazione post-addestramento a 1 bit progettato specificamente per LLM pre-addestrati. Basandosi sulla distribuzione dei pesi degli LLM, BiLLM identifica e seleziona strutturalmente i pesi salienti, minimizzando la perdita di compressione attraverso un'efficace strategia di approssimazione binaria residua. Inoltre, considerando la distribuzione a campana dei pesi non salienti, proponiamo una ricerca ottimale di suddivisione per raggrupparli e binarizzarli con precisione. BiLLM raggiunge per la prima volta un'inferenza ad alta precisione (ad esempio, 8,41 di perplessità su LLaMA2-70B) con pesi di soli 1,08 bit su varie famiglie di LLM e metriche di valutazione, superando significativamente i metodi di quantizzazione SOTA per LLM. Inoltre, BiLLM consente il processo di binarizzazione di un LLM con 7 miliardi di pesi in meno di 0,5 ore su una singola GPU, dimostrando un'efficienza temporale soddisfacente.

English

Pretrained large language models (LLMs) exhibit exceptional general language processing capabilities but come with significant demands on memory and computational resources. As a powerful compression technology, binarization can extremely reduce model weights to a mere 1 bit, lowering the expensive computation and memory requirements. However, existing quantization techniques fall short of maintaining LLM performance under ultra-low bit-widths. In response to this challenge, we present BiLLM, a groundbreaking 1-bit post-training quantization scheme tailored for pretrained LLMs. Based on the weight distribution of LLMs, BiLLM first identifies and structurally selects salient weights, and minimizes the compression loss through an effective binary residual approximation strategy. Moreover, considering the bell-shaped distribution of the non-salient weights, we propose an optimal splitting search to group and binarize them accurately. BiLLM achieving for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families and evaluation metrics, outperforms SOTA quantization methods of LLM by significant margins. Moreover, BiLLM enables the binarization process of the LLM with 7 billion weights within 0.5 hours on a single GPU, demonstrating satisfactory time efficiency.

BiLLM: Spingere al limite la quantizzazione post-addestramento per i modelli linguistici di grandi dimensioni

BiLLM: Pushing the Limit of Post-Training Quantization for LLMs

Abstract

Support