BitNet a4.8:用於1位元LLM的4位元啟動
BitNet a4.8: 4-bit Activations for 1-bit LLMs
November 7, 2024
作者: Hongyu Wang, Shuming Ma, Furu Wei
cs.AI
摘要
最近對於1位元大型語言模型(LLMs)的研究,如BitNet b1.58,提出了一個有望降低LLMs推論成本的方向,同時保持其性能。在這項工作中,我們介紹了BitNet a4.8,為1位元LLMs啟用4位元激活。BitNet a4.8採用混合量化和稀疏化策略,以減輕異常通道引入的量化誤差。具體而言,我們將4位元激活用於注意力和前饋網絡層的輸入,同時將中間狀態稀疏化,然後進行8位元量化。大量實驗表明,BitNet a4.8實現了與BitNet b1.58相當的性能,並具有相同的訓練成本,同時在啟用4位元(INT4/FP4)核心的推論速度更快。此外,BitNet a4.8僅激活55%的參數,並支持3位元KV快取,進一步增強了大規模LLM部署和推論的效率。
English
Recent research on the 1-bit Large Language Models (LLMs), such as BitNet
b1.58, presents a promising direction for reducing the inference cost of LLMs
while maintaining their performance. In this work, we introduce BitNet a4.8,
enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid
quantization and sparsification strategy to mitigate the quantization errors
introduced by the outlier channels. Specifically, we utilize 4-bit activations
for inputs to the attention and feed-forward network layers, while sparsifying
intermediate states followed with 8-bit quantization. Extensive experiments
demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58
with equivalent training costs, while being faster in inference with enabling
4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of
parameters and supports 3-bit KV cache, further enhancing the efficiency of
large-scale LLM deployment and inference.