BitNet a4.8: Attivazioni a 4 bit per LLM a 1 bit

Abstract

Recenti ricerche sui Large Language Model (LLM) a 1 bit, come BitNet b1.58, indicano una direzione promettente per ridurre il costo di inferenza degli LLM mantenendone le prestazioni. In questo lavoro, presentiamo BitNet a4.8, che abilita attivazioni a 4 bit per LLM a 1 bit. BitNet a4.8 utilizza una strategia ibrida di quantizzazione e sparsificazione per mitigare gli errori di quantizzazione introdotti dai canali outlier. Nello specifico, impieghiamo attivazioni a 4 bit per gli ingressi agli strati di attenzione e di feed-forward, mentre sparsifichiamo gli stati intermedi seguiti da una quantizzazione a 8 bit. Esperimenti estensivi dimostrano che BitNet a4.8 raggiunge prestazioni paragonabili a BitNet b1.58 con costi di addestramento equivalenti, risultando al contempo più veloce nell'inferenza grazie all'abilitazione di kernel a 4 bit (INT4/FP4). Inoltre, BitNet a4.8 attiva solo il 55% dei parametri e supporta una KV cache a 3 bit, migliorando ulteriormente l'efficienza della distribuzione e dell'inferenza di LLM su larga scala.

English

Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

BitNet a4.8: Attivazioni a 4 bit per LLM a 1 bit

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Abstract

Support