邁向下一代超大規模Transformer的後訓練量化

摘要

隨著生成式人工智慧模型日益複雜，事後量化（PTQ）已成為在邊緣設備（如移動設備和電視）上部署超大規模模型的有前途的解決方案。然而，現有的PTQ方案耗費相當多的時間和資源，在實際情況中可能成為瓶頸，特別是在需要頻繁模型更新和多次超參數調整的情況下。作為一種具有成本效益的替代方案，已提出了一次性PTQ方案。然而，由於無法考慮注意力模組中各層之間的相互依賴，這些方案的性能仍然有些受限，而這是Transformer模型中非常重要的特徵。因此，在本文中，我們提出了一種新穎的PTQ算法，以平衡準確性和效率。所提出的名為aespa的算法的關鍵思想是為了效率而進行逐層量化，同時考慮跨層依賴以保留注意力分數。通過對各種語言模型的廣泛實驗和複雜性分析，我們證明了aespa在量化Transformer模型時既準確又高效。

English

With the increasing complexity of generative AI models, post-training quantization (PTQ) has emerged as a promising solution for deploying hyper-scale models on edge devices such as mobile devices and TVs. Existing PTQ schemes, however, consume considerable time and resources, which could be a bottleneck in real situations where frequent model updates and multiple hyper-parameter tunings are required. As a cost-effective alternative, one-shot PTQ schemes have been proposed. Still, the performance is somewhat limited because they cannot consider the inter-layer dependency within the attention module, which is a very important feature of Transformers. In this paper, we thus propose a novel PTQ algorithm that balances accuracy and efficiency. The key idea of the proposed algorithm called aespa is to perform quantization layer-wise for efficiency while considering cross-layer dependency to preserve the attention score. Through extensive experiments on various language models and complexity analysis, we demonstrate that aespa is accurate and efficient in quantizing Transformer models.

邁向下一代超大規模Transformer的後訓練量化

Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers

摘要

Support