ChatPaper.aiChatPaper

BitNet a4.8:用於1位元LLM的4位元啟動

BitNet a4.8: 4-bit Activations for 1-bit LLMs

November 7, 2024
作者: Hongyu Wang, Shuming Ma, Furu Wei
cs.AI

摘要

最近對於1位元大型語言模型(LLMs)的研究,如BitNet b1.58,提出了一個有望降低LLMs推論成本的方向,同時保持其性能。在這項工作中,我們介紹了BitNet a4.8,為1位元LLMs啟用4位元激活。BitNet a4.8採用混合量化和稀疏化策略,以減輕異常通道引入的量化誤差。具體而言,我們將4位元激活用於注意力和前饋網絡層的輸入,同時將中間狀態稀疏化,然後進行8位元量化。大量實驗表明,BitNet a4.8實現了與BitNet b1.58相當的性能,並具有相同的訓練成本,同時在啟用4位元(INT4/FP4)核心的推論速度更快。此外,BitNet a4.8僅激活55%的參數,並支持3位元KV快取,進一步增強了大規模LLM部署和推論的效率。
English
Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.
PDF696December 4, 2025