초대규모 트랜스포머의 차세대 학습 후 양자화 기술로의 진보

초록

생성형 AI 모델의 복잡성이 증가함에 따라, 모바일 기기 및 TV와 같은 엣지 디바이스에 초대규모 모델을 배포하기 위한 유망한 솔루션으로 사후 학습 양자화(PTQ)가 부상하고 있다. 그러나 기존의 PTQ 기법은 상당한 시간과 자원을 소모하며, 이는 빈번한 모델 업데이트와 다중 하이퍼파라미터 튜닝이 필요한 실제 상황에서 병목 현상이 될 수 있다. 비용 효율적인 대안으로, 원샷 PTQ 기법이 제안되었다. 그러나 이러한 기법은 트랜스포머의 매우 중요한 특징인 어텐션 모듈 내의 계층 간 의존성을 고려할 수 없기 때문에 성능이 다소 제한적이다. 본 논문에서는 정확도와 효율성을 균형 있게 유지하는 새로운 PTQ 알고리즘을 제안한다. aespa라고 명명된 제안 알고리즘의 핵심 아이디어는 효율성을 위해 계층별 양자화를 수행하면서도 어텐션 점수를 보존하기 위해 계층 간 의존성을 고려하는 것이다. 다양한 언어 모델에 대한 광범위한 실험과 복잡도 분석을 통해, aespa가 트랜스포머 모델을 양자화하는 데 있어 정확하고 효율적임을 입증한다.

English

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

초대규모 트랜스포머의 차세대 학습 후 양자화 기술로의 진보

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

초록

Support