BitNet a4.8：用於1位元LLM的4位元啟動

摘要

最近對於1位元大型語言模型（LLMs）的研究，如BitNet b1.58，提出了一個有望降低LLMs推論成本的方向，同時保持其性能。在這項工作中，我們介紹了BitNet a4.8，為1位元LLMs啟用4位元激活。BitNet a4.8採用混合量化和稀疏化策略，以減輕異常通道引入的量化誤差。具體而言，我們將4位元激活用於注意力和前饋網絡層的輸入，同時將中間狀態稀疏化，然後進行8位元量化。大量實驗表明，BitNet a4.8實現了與BitNet b1.58相當的性能，並具有相同的訓練成本，同時在啟用4位元（INT4/FP4）核心的推論速度更快。此外，BitNet a4.8僅激活55％的參數，並支持3位元KV快取，進一步增強了大規模LLM部署和推論的效率。

English

Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

BitNet a4.8：用於1位元LLM的4位元啟動

BitNet a4.8: 4-bit Activations for 1-bit LLMs

摘要

Support