可量化Transformer：通过帮助注意力头部去除异常值无操作

摘要

在过去几年中，Transformer模型已被广泛应用于各个领域，特别是大型语言模型显著推动了人工智能领域的发展。由于它们的规模，这些网络的能力已经大幅提升，但这也伴随着对计算资源的显著增加。量化是减少神经网络计算时间和内存消耗的最有效方法之一。然而，许多研究表明，现代Transformer模型往往会在其激活中学习到强烈的异常值，使得量化变得困难。为了保持可接受的性能，存在这些异常值需要将激活设置为更高的比特宽度，或者使用不同的数值格式、额外的微调或其他变通方法。我们发现，强烈的异常值与注意力头的特定行为相关，它们试图学习“无操作”或仅对残差进行部分更新。为了在注意力矩阵中实现所需的确切零值以进行无更新，softmax的输入在训练过程中被推动变得越来越大，导致网络其他部分出现异常值。基于这些观察结果，我们提出了两种简单（独立的）注意力机制修改方法 - 截断softmax和门控注意力。我们经验证明，使用我们的方法预训练的模型学习到的异常值显著较小，同时保持甚至提高了浮点任务性能。这使我们能够将Transformer模型量化为完整的INT8激活量化，而无需额外努力。我们展示了我们的方法在语言模型（BERT、OPT）和视觉Transformer上的有效性。

English

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

可量化Transformer：通过帮助注意力头部去除异常值无操作

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

摘要

Support