量子化可能なTransformer：Attention Headを支援することで外れ値を除去 Do Nothing

要旨

Transformerモデルはここ数年、様々な領域で広く採用されており、特に大規模言語モデルはAI分野を大きく進展させてきた。その規模ゆえに、これらのネットワークの能力は飛躍的に向上したが、その代償として必要な計算量も大幅に増加した。量子化は、ニューラルネットワークの計算時間とメモリ消費を削減する最も効果的な方法の一つである。しかし、多くの研究が示すように、現代のTransformerモデルは活性化関数において強い外れ値を学習する傾向があり、量子化を困難にしている。許容可能な性能を維持するためには、これらの外れ値の存在により、活性化関数をより高いビット幅で使用するか、異なる数値フォーマットを採用するか、追加のファインチューニングを行うか、あるいは他の回避策を講じる必要がある。我々は、強い外れ値が、特定の注意ヘッドの振る舞い、特に「無操作（no-op）」または残差の部分的な更新を学習しようとする振る舞いに関連していることを示す。無更新のために注意行列で必要な正確なゼロを達成するため、ソフトマックスへの入力は訓練中にますます大きくなり、ネットワークの他の部分で外れ値を引き起こす。これらの観察に基づき、我々は注意メカニズムに2つのシンプルな（独立した）修正を提案する - クリップドソフトマックスとゲート付きアテンションである。我々の方法を用いて事前学習されたモデルは、浮動小数点タスクの性能を維持し、時には向上させながら、有意に小さな外れ値を学習することを実証的に示す。これにより、Transformerを追加の手間なしに活性化関数の完全なINT8量子化に成功した。我々は、言語モデル（BERT、OPT）とビジョントランスフォーマーの両方で、これらの方法の有効性を実証している。

English

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

量子化可能なTransformer：Attention Headを支援することで外れ値を除去 Do Nothing

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

要旨

Support