OmniQuant：用於大型語言模型的全方位校準量化

摘要

大型語言模型（LLMs）已經革新了自然語言處理任務。然而，它們的實際部署受到其龐大的內存和計算需求的阻礙。儘管最近的後訓練量化（PTQ）方法在減少內存佔用和提高LLM的計算效率方面是有效的，但它們手工設計量化參數，這導致性能低下並無法應對極低位元量化。為了應對這個問題，我們引入了一種全方位校準量化（OmniQuant）技術，適用於LLMs，在各種量化設置中實現良好性能，同時通過有效優化各種量化參數來保持PTQ的計算效率。OmniQuant包括兩個創新組件，包括可學習的權重截斷（LWC）和可學習的等效轉換（LET）。LWC通過優化截斷閾值調節權重的極端值。與此同時，LET通過可學習的等效轉換將量化的挑戰從激活轉移到權重，以應對激活的異常值。OmniQuant在可微分框架中運作，使用塊狀誤差最小化，可以高效地優化權重僅和權重-激活量化的量化過程。例如，大小為7-70B的LLaMA-2模型系列可以在單個A100-40G GPU上使用128個樣本在1-16小時內使用OmniQuant進行處理。廣泛的實驗驗證了OmniQuant在各種量化配置（如W4A4、W6A6、W4A16、W3A16和W2A16）中的優越性能。此外，OmniQuant在指令調整模型中展現了有效性，在實際設備上提高了推理速度和減少了內存佔用。代碼和模型可在https://github.com/OpenGVLab/OmniQuant 找到。

English

Large language models (LLMs) have revolutionized natural language processing tasks. However, their practical deployment is hindered by their immense memory and computation requirements. Although recent post-training quantization (PTQ) methods are effective in reducing memory footprint and improving the computational efficiency of LLM, they hand-craft quantization parameters, which leads to low performance and fails to deal with extremely low-bit quantization. To tackle this issue, we introduce an Omnidirectionally calibrated Quantization (OmniQuant) technique for LLMs, which achieves good performance in diverse quantization settings while maintaining the computational efficiency of PTQ by efficiently optimizing various quantization parameters. OmniQuant comprises two innovative components including Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). LWC modulates the extreme values of weights by optimizing the clipping threshold. Meanwhile, LET tackles activation outliers by shifting the challenge of quantization from activations to weights through a learnable equivalent transformation. Operating within a differentiable framework using block-wise error minimization, OmniQuant can optimize the quantization process efficiently for both weight-only and weight-activation quantization. For instance, the LLaMA-2 model family with the size of 7-70B can be processed with OmniQuant on a single A100-40G GPU within 1-16 hours using 128 samples. Extensive experiments validate OmniQuant's superior performance across diverse quantization configurations such as W4A4, W6A6, W4A16, W3A16, and W2A16. Additionally, OmniQuant demonstrates effectiveness in instruction-tuned models and delivers notable improvements in inference speed and memory reduction on real devices. Codes and models are available at https://github.com/OpenGVLab/OmniQuant.

OmniQuant：用於大型語言模型的全方位校準量化

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

摘要

Support