TEQ:可训练等效转换用于低比特率模型的量化
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
October 17, 2023
作者: Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
cs.AI
摘要
随着大型语言模型(LLMs)变得越来越普遍,对新型和改进的量化方法的需求日益增长,这些方法可以满足这些现代架构的计算需求,同时保持准确性。在本文中,我们提出了TEQ,这是一种可训练的等效转换,可以保持模型输出的FP32精度,同时利用低精度量化,特别是3位和4位的仅权重量化。训练过程轻量级,仅需要1K步骤和少于原始模型可训练参数的0.1%。此外,该转换在推断期间不会增加任何计算开销。我们的结果与典型LLMs上的最先进方法(SOTA)持平。我们的方法可以与其他方法结合,以实现更好的性能。代码可在https://github.com/intel/neural-compressor 上找到。
English
As large language models (LLMs) become more prevalent, there is a growing
need for new and improved quantization methods that can meet the
computationalast layer demands of these modern architectures while maintaining
the accuracy. In this paper, we present TEQ, a trainable equivalent
transformation that preserves the FP32 precision of the model output while
taking advantage of low-precision quantization, especially 3 and 4 bits
weight-only quantization. The training process is lightweight, requiring only
1K steps and fewer than 0.1 percent of the original model's trainable
parameters. Furthermore, the transformation does not add any computational
overhead during inference. Our results are on-par with the state-of-the-art
(SOTA) methods on typical LLMs. Our approach can be combined with other methods
to achieve even better performance. The code is available at
https://github.com/intel/neural-compressor.