TEQ: 대형 언어 모델 양자화를 위한 학습 가능한 등가 변환

초록

대규모 언어 모델(LLM)이 점점 더 보편화됨에 따라, 이러한 현대적 아키텍처의 계산적 요구를 충족시키면서도 정확도를 유지할 수 있는 새로운 양자화 방법의 필요성이 커지고 있습니다. 본 논문에서는 FP32 정밀도를 유지하면서도 저정밀도 양자화, 특히 3비트 및 4비트 가중치 전용 양자화의 이점을 활용할 수 있는 학습 가능한 등가 변환인 TEQ를 제안합니다. 학습 과정은 경량화되어 있으며, 단 1,000단계와 원본 모델의 학습 가능한 매개변수의 0.1% 미만만 필요합니다. 또한, 이 변환은 추론 과정에서 어떠한 계산적 오버헤드도 추가하지 않습니다. 우리의 결과는 일반적인 LLM에서 최신 기술(SOTA) 방법과 동등한 수준입니다. 우리의 접근 방식은 다른 방법과 결합하여 더 나은 성능을 달성할 수 있습니다. 코드는 https://github.com/intel/neural-compressor에서 확인할 수 있습니다.

English

As large language models (LLMs) become more prevalent, there is a growing need for new and improved quantization methods that can meet the computationalast layer demands of these modern architectures while maintaining the accuracy. In this paper, we present TEQ, a trainable equivalent transformation that preserves the FP32 precision of the model output while taking advantage of low-precision quantization, especially 3 and 4 bits weight-only quantization. The training process is lightweight, requiring only 1K steps and fewer than 0.1 percent of the original model's trainable parameters. Furthermore, the transformation does not add any computational overhead during inference. Our results are on-par with the state-of-the-art (SOTA) methods on typical LLMs. Our approach can be combined with other methods to achieve even better performance. The code is available at https://github.com/intel/neural-compressor.

TEQ: 대형 언어 모델 양자화를 위한 학습 가능한 등가 변환

TEQ: Trainable Equivalent Transformation for Quantization of LLMs

초록

Support