TEQ：可训练等效转换用于低比特率模型的量化

摘要

随着大型语言模型（LLMs）变得越来越普遍，对新型和改进的量化方法的需求日益增长，这些方法可以满足这些现代架构的计算需求，同时保持准确性。在本文中，我们提出了TEQ，这是一种可训练的等效转换，可以保持模型输出的FP32精度，同时利用低精度量化，特别是3位和4位的仅权重量化。训练过程轻量级，仅需要1K步骤和少于原始模型可训练参数的0.1%。此外，该转换在推断期间不会增加任何计算开销。我们的结果与典型LLMs上的最先进方法（SOTA）持平。我们的方法可以与其他方法结合，以实现更好的性能。代码可在https://github.com/intel/neural-compressor 上找到。

English

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

TEQ：可训练等效转换用于低比特率模型的量化

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

摘要

Support