ハイパースケールTransformerの次世代ポストトレーニング量子化に向けて

要旨

生成AIモデルの複雑化に伴い、ポストトレーニング量子化（PTQ）は、モバイルデバイスやテレビなどのエッジデバイスに大規模モデルを展開するための有望なソリューションとして注目されています。しかし、既存のPTQスキームは、多くの時間とリソースを消費するため、頻繁なモデル更新や複数のハイパーパラメータチューニングが必要な実際の状況ではボトルネックとなる可能性があります。コスト効率の高い代替案として、ワンショットPTQスキームが提案されていますが、Transformerの重要な特徴であるアテンションモジュール内の層間依存性を考慮できないため、性能がやや限られています。本論文では、精度と効率のバランスを取る新しいPTQアルゴリズムを提案します。提案アルゴリズム「aespa」の鍵となるアイデアは、効率のために層ごとに量子化を行いながら、アテンションスコアを保持するために層間依存性を考慮することです。さまざまな言語モデルでの広範な実験と複雑性分析を通じて、aespaがTransformerモデルの量子化において正確かつ効率的であることを実証します。

English

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

ハイパースケールTransformerの次世代ポストトレーニング量子化に向けて

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

要旨

Support