TEQ:可訓練等效轉換用於低精度量化的技術
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
October 17, 2023
作者: Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
cs.AI
摘要
隨著大型語言模型(LLMs)變得更加普及,對於新型和改進的量化方法的需求日益增加,這些方法需要滿足現代架構的計算需求,同時保持準確性。在本文中,我們提出了TEQ,一種可訓練的等效轉換,它在保持模型輸出的FP32精度的同時,利用低精度量化,特別是3位和4位僅權重量化。訓練過程輕量級,僅需1K步驟和少於原始模型可訓練參數的0.1%。此外,該轉換在推論過程中不會增加任何計算開銷。我們的結果與典型LLMs的最新方法相當。我們的方法可以與其他方法結合,以獲得更好的性能。代碼可在https://github.com/intel/neural-compressor找到。
English
As large language models (LLMs) become more prevalent, there is a growing
need for new and improved quantization methods that can meet the
computationalast layer demands of these modern architectures while maintaining
the accuracy. In this paper, we present TEQ, a trainable equivalent
transformation that preserves the FP32 precision of the model output while
taking advantage of low-precision quantization, especially 3 and 4 bits
weight-only quantization. The training process is lightweight, requiring only
1K steps and fewer than 0.1 percent of the original model's trainable
parameters. Furthermore, the transformation does not add any computational
overhead during inference. Our results are on-par with the state-of-the-art
(SOTA) methods on typical LLMs. Our approach can be combined with other methods
to achieve even better performance. The code is available at
https://github.com/intel/neural-compressor.