ChatPaper.aiChatPaper

BitNet a4.8:用于1位LLMs的4位激活

BitNet a4.8: 4-bit Activations for 1-bit LLMs

November 7, 2024
作者: Hongyu Wang, Shuming Ma, Furu Wei
cs.AI

摘要

最近关于1比特大型语言模型(LLMs)的研究,如BitNet b1.58,提出了一种有望降低LLMs推断成本同时保持性能的方向。在这项工作中,我们介绍了BitNet a4.8,为1比特LLMs实现了4比特激活。BitNet a4.8采用混合量化和稀疏化策略来减轻异常通道引入的量化误差。具体而言,我们利用4比特激活来处理注意力和前馈网络层的输入,同时稀疏化后续的中间状态,并进行8比特量化。大量实验证明,BitNet a4.8在等效训练成本下实现了与BitNet b1.58相媲美的性能,同时在启用4比特(INT4/FP4)内核的推断速度更快。此外,BitNet a4.8仅激活55%的参数,支持3比特KV缓存,进一步提高了大规模LLM部署和推断的效率。
English
Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.
PDF696December 4, 2025