QuEST: Stabiele Training van LLM's met 1-Bit Gewichten en Activaties

Samenvatting

Een benadering om de enorme kosten van grote taalmodellen (LLMs) te verlagen, is het gebruik van gekwantiseerde of spaarzame representaties voor training of implementatie. Hoewel post-training compressiemethoden zeer populair zijn, is de vraag of nog nauwkeurigere gecomprimeerde modellen kunnen worden verkregen door rechtstreeks te trainen over dergelijke representaties, d.w.z. Quantization-Aware Training (QAT), nog open: bijvoorbeeld, een recente studie (arXiv:2411.04330v2) stelde de "optimale" bit-breedte vast waarop modellen kunnen worden getraind met behulp van QAT, terwijl ze concurrerend blijven qua nauwkeurigheid met standaard FP16/BF16 precisie, op 8-bits gewichten en activaties. We brengen deze state-of-the-art verder met een nieuwe methode genaamd QuEST, die Pareto-concurrerend is met FP16, d.w.z. het biedt betere nauwkeurigheid bij een kleinere modelgrootte, terwijl modellen worden getraind met gewichten en activaties in 4-bits of minder. Bovendien maakt QuEST stabiele training mogelijk met 1-bit gewichten en activaties. QuEST bereikt dit door twee belangrijke aspecten van QAT-methoden te verbeteren: (1) nauwkeurige en snelle kwantisatie van de (continue) distributies van gewichten en activaties via Hadamard-normalisatie en MSE-optimale fitting; (2) een nieuwe trust gradient estimator gebaseerd op het idee om expliciet de fout tussen de lawaaierige gradient berekend over gekwantiseerde toestanden en de "ware" (maar onbekende) volledig-precisie gradient te minimaliseren. Experimenten op Llama-type architecturen tonen aan dat QuEST stabiele schalingswetten induceert over het gehele bereik van door hardware ondersteunde precisies, en kan worden uitgebreid naar spaarzame representaties. We bieden GPU kernelondersteuning aan waaruit blijkt dat modellen geproduceerd door QuEST efficiënt kunnen worden uitgevoerd. Onze code is beschikbaar op https://github.com/IST-DASLab/QuEST.

English

One approach to reducing the massive costs of large language models (LLMs) is the use of quantized or sparse representations for training or deployment. While post-training compression methods are very popular, the question of obtaining even more accurate compressed models by directly training over such representations, i.e., Quantization-Aware Training (QAT), is still open: for example, a recent study (arXiv:2411.04330v2) put the "optimal" bit-width at which models can be trained using QAT, while staying accuracy-competitive with standard FP16/BF16 precision, at 8-bits weights and activations. We advance this state-of-the-art via a new method called QuEST, which is Pareto-competitive with FP16, i.e., it provides better accuracy at lower model size, while training models with weights and activations in 4-bits or less. Moreover, QuEST allows stable training with 1-bit weights and activations. QuEST achieves this by improving two key aspects of QAT methods: (1) accurate and fast quantization of the (continuous) distributions of weights and activations via Hadamard normalization and MSE-optimal fitting; (2) a new trust gradient estimator based on the idea of explicitly minimizing the error between the noisy gradient computed over quantized states and the "true" (but unknown) full-precision gradient. Experiments on Llama-type architectures show that QuEST induces stable scaling laws across the entire range of hardware-supported precisions, and can be extended to sparse representations. We provide GPU kernel support showing that models produced by QuEST can be executed efficiently. Our code is available at https://github.com/IST-DASLab/QuEST.

QuEST: Stabiele Training van LLM's met 1-Bit Gewichten en Activaties

QuEST: Stable Training of LLMs with 1-Bit Weights and Activations

Samenvatting

Support