TEQ：可訓練等效轉換用於低精度量化的技術

摘要

隨著大型語言模型（LLMs）變得更加普及，對於新型和改進的量化方法的需求日益增加，這些方法需要滿足現代架構的計算需求，同時保持準確性。在本文中，我們提出了TEQ，一種可訓練的等效轉換，它在保持模型輸出的FP32精度的同時，利用低精度量化，特別是3位和4位僅權重量化。訓練過程輕量級，僅需1K步驟和少於原始模型可訓練參數的0.1％。此外，該轉換在推論過程中不會增加任何計算開銷。我們的結果與典型LLMs的最新方法相當。我們的方法可以與其他方法結合，以獲得更好的性能。代碼可在https://github.com/intel/neural-compressor找到。

English

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

TEQ：可訓練等效轉換用於低精度量化的技術

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

摘要

Support