EfficientQAT: Efficiënte Kwantisatiebewuste Training voor Grote Taalmodellen

Samenvatting

Grote taalmmodellen (LLMs) zijn essentieel voor moderne natuurlijke taalverwerking en kunstmatige intelligentie. Ze worden echter geconfronteerd met uitdagingen bij het beheren van hun aanzienlijke geheugeneisen. Hoewel kwantisatiebewuste training (QAT) een oplossing biedt door het geheugengebruik te verminderen via laagbitrepresentaties met minimale nauwkeurigheidsverliezen, vereist het aanzienlijke trainingsbronnen om modelgewichten en kwantisatieparameters te optimaliseren. Om dit aan te pakken, stellen we Efficient Quantization-Aware Training (EfficientQAT) voor, een nieuwe kwantisatietechniek voor het comprimeren van LLMs. EfficientQAT omvat twee opeenvolgende fasen: Bloksgewijze training van alle parameters (Block-AP) en end-to-end training van kwantisatieparameters (E2E-QP). Block-AP voert sequentieel kwantisatiebewuste training uit voor alle parameters in elk transformatorblok met bloksgewijze reconstructie, waarbij efficiëntie wordt behouden door het vermijden van het trainen van het gehele LLM. Geïnitialiseerd met een gekwantiseerd model, traint E2E-QP vervolgens alleen kwantisatieparameters (stapgroottes) end-to-end, waardoor de efficiëntie wordt verbeterd met een vast gekwantiseerd skelet en een verminderd aantal trainbare parameters. Uitgebreide experimenten tonen aan dat EfficientQAT eerdere kwantisatiemethoden overtreft voor een reeks modellen, waaronder basis-LLMs, instructiegetrainde LLMs en multimodale LLMs, met schalen van 7B tot 70B parameters bij verschillende kwantisatiebits. Zo verkrijgt EfficientQAT bijvoorbeeld een 2-bit Llama-2-70B model op een enkele A100-80GB GPU in 41 uur, met minder dan 3\% nauwkeurigheidsverlies vergeleken met volledige precisie (69.48 vs. 72.41). Opmerkelijk is dat dit INT2-gekwantiseerde 70B model een nauwkeurigheidswinst van 1.67 behaalt ten opzichte van het Llama-2-13B model (69.48 vs. 67.81) terwijl het minder geheugen vereist (19.2GB vs. 24.2GB). Code is beschikbaar op https://github.com/OpenGVLab/EfficientQAT.

English

Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.

EfficientQAT: Efficiënte Kwantisatiebewuste Training voor Grote Taalmodellen

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Samenvatting

Support