EfficientQAT：針對大型語言模型的高效量化感知訓練

摘要

大型語言模型（LLMs）對現代自然語言處理和人工智能至關重要。然而，它們在管理龐大的記憶需求方面面臨挑戰。儘管量化感知訓練（QAT）通過使用低位表示來減少記憶體消耗並最小化準確性損失提供了解決方案，但需要大量訓練資源來優化模型權重和量化參數。為了應對這一問題，我們提出了高效量化感知訓練（EfficientQAT），這是壓縮LLMs的一種新型量化技術。EfficientQAT包括兩個連續階段：所有參數的塊訓練（Block-AP）和端到端訓練的量化參數（E2E-QP）。Block-AP通過塊狀重建依次對每個變壓器塊中的所有參數進行量化感知訓練，通過避免對整個LLM進行訓練來保持效率。初始化為量化模型後，E2E-QP然後僅端到端訓練量化參數（步長），通過固定量化的主幹和減少可訓練參數數量來提高效率。大量實驗表明，EfficientQAT在各種模型上表現優於以往的量化方法，包括基本LLMs、指令調整LLMs和多模態LLMs，參數規模從7B到70B不等，量化位數也不同。例如，EfficientQAT在單個A100-80GB GPU上以41小時獲得了一個2位元的Llama-2-70B模型，與完整精度相比（69.48 vs. 72.41）僅有不到3％的準確性降低。值得注意的是，這個INT2量化的70B模型比Llama-2-13B模型（69.48 vs. 67.81）獲得了1.67的準確性增益，同時需要更少的記憶體（19.2GB vs. 24.2GB）。代碼可在https://github.com/OpenGVLab/EfficientQAT獲取。

English

Large language models (LLMs) are integral to modern natural language processing and artificial intelligence. However, they face challenges in managing their significant memory requirements. Although quantization-aware training (QAT) offers a solution by reducing memory consumption through low-bit representations with minimal accuracy loss, it demands substantial training resources to optimize model weights and quantization parameters. To address this, we propose Efficient Quantization-Aware Training (EfficientQAT), a novel quantization technique for compressing LLMs. EfficientQAT involves two consecutive phases: Block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP). Block-AP sequentially conducts quantization-aware training for all parameters in each transformer block with block-wise reconstruction, maintaining efficiency by avoiding training the entire LLM. Initialized with quantized model, E2E-QP then trains only quantization parameters (step sizes) end-to-end, enhancing efficiency with a fixed quantized backbone and reduced trainable parameter count. Extensive experiments demonstrate that EfficientQAT outperforms previous quantization methods across a range of models, including base LLMs, instruction-tuned LLMs, and multimodal LLMs, with scales from 7B to 70B parameters at various quantization bits. For instance, EfficientQAT obtains a 2-bit Llama-2-70B model on a single A100-80GB GPU in 41 hours, with less than 3\% accuracy degradation compared to the full precision (69.48 vs. 72.41). Notably, this INT2 quantized 70B model obtains a 1.67 accuracy gain over the Llama-2-13B model (69.48 vs. 67.81) while requiring less memory (19.2GB vs. 24.2GB). Code is available at https://github.com/OpenGVLab/EfficientQAT.

EfficientQAT：針對大型語言模型的高效量化感知訓練

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

摘要

Support