迈向超大规模Transformer模型的下一代后训练量化

摘要

随着生成式人工智能模型日益复杂，后训练量化（PTQ）已成为在移动设备和电视等边缘设备部署超大规模模型的有前途的解决方案。然而，现有的PTQ方案耗费相当多的时间和资源，在需要频繁模型更新和多次超参数调整的实际情况下可能成为瓶颈。作为一种具有成本效益的替代方案，已经提出了一次性PTQ方案。然而，由于无法考虑注意力模块中的层间依赖关系，性能仍然有一定限制，而这是Transformer中非常重要的特性。因此，在本文中，我们提出了一种新颖的PTQ算法，平衡了准确性和效率。所提出的名为"AESPA"的算法的关键思想是为了效率而逐层执行量化，同时考虑跨层依赖以保留注意力分数。通过对各种语言模型的广泛实验和复杂性分析，我们证明了"AESPA"在量化Transformer模型方面既准确又高效。

English

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

迈向超大规模Transformer模型的下一代后训练量化

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

摘要

Support