EfficientQAT：大型语言模型的高效量化感知训练

摘要

大型语言模型（LLMs）对现代自然语言处理和人工智能至关重要。然而，它们在管理显著的内存需求方面面临挑战。尽管量化感知训练（QAT）通过采用低位表示来减少内存消耗并最小化精度损失提供了解决方案，但它需要大量训练资源来优化模型权重和量化参数。为了解决这个问题，我们提出了高效量化感知训练（EfficientQAT），这是一种用于压缩LLMs的新型量化技术。EfficientQAT包括两个连续阶段：所有参数的块状训练（Block-AP）和量化参数的端到端训练（E2E-QP）。Block-AP依次对每个变压器块中的所有参数进行量化感知训练，并通过块状重构来保持效率，避免对整个LLM进行训练。使用量化模型初始化，E2E-QP然后端到端训练仅量化参数（步长），通过固定的量化骨干和减少的可训练参数数量提高效率。大量实验证明，EfficientQAT在各种模型上表现优于以往的量化方法，包括基础LLMs、指令调整LLMs和多模态LLMs，参数规模从7B到70B，量化位数不等。例如，EfficientQAT在单个A100-80GB GPU上以41小时获得了一个2位Llama-2-70B模型，与全精度相比准确度下降不到3%（69.48对72.41）。值得注意的是，这个INT2量化的70B模型比Llama-2-13B模型（69.48对67.81）准确度提高了1.67，同时需要更少的内存（19.2GB对24.2GB）。代码可在https://github.com/OpenGVLab/EfficientQAT找到。

English

Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.

EfficientQAT：大型语言模型的高效量化感知训练

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

摘要

Support