可量化的Transformer:通過幫助注意力頭部去除異常值,什麼也不做。
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
June 22, 2023
作者: Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
cs.AI
摘要
Transformer 模型在過去幾年被廣泛應用於各個領域,特別是大型語言模型顯著推動了人工智慧領域的發展。由於它們的規模,這些網絡的能力大幅提升,但這也伴隨著對計算資源的顯著增加。量化是減少神經網絡計算時間和內存消耗的最有效方法之一。然而,許多研究表明,現代 Transformer 模型往往會在其激活中學習到強烈的離群值,使得量化變得困難。為了保持可接受的性能,這些離群值的存在要求激活位元數更高,或者使用不同的數值格式、額外的微調或其他解決方法。我們指出,強烈的離群值與注意力頭部的特定行為有關,這些頭部試圖學習“無操作”或僅對殘差進行部分更新。為了實現注意力矩陣中所需的精確零值以進行無更新,softmax 的輸入在訓練期間被推動得越來越大,導致網絡其他部分出現離群值。基於這些觀察,我們提出了兩種簡單(獨立的)修改注意機制的方法 - 截斷 softmax 和閘控注意力。我們實證表明,使用我們方法預訓練的模型學習到的離群值顯著較小,同時保持甚至提升浮點任務性能。這使我們能夠將 Transformer 模型量化為完整的 INT8 量化而無需任何額外努力。我們展示了我們方法在語言模型(BERT、OPT)和視覺 Transformer 上的有效性。
English
Transformer models have been widely adopted in various domains over the last
years, and especially large language models have advanced the field of AI
significantly. Due to their size, the capability of these networks has
increased tremendously, but this has come at the cost of a significant increase
in necessary compute. Quantization is one of the most effective ways to reduce
the computational time and memory consumption of neural networks. Many studies
have shown, however, that modern transformer models tend to learn strong
outliers in their activations, making them difficult to quantize. To retain
acceptable performance, the existence of these outliers requires activations to
be in higher bitwidth or the use of different numeric formats, extra
fine-tuning, or other workarounds. We show that strong outliers are related to
very specific behavior of attention heads that try to learn a "no-op" or just a
partial update of the residual. To achieve the exact zeros needed in the
attention matrix for a no-update, the input to the softmax is pushed to be
larger and larger during training, causing outliers in other parts of the
network. Based on these observations, we propose two simple (independent)
modifications to the attention mechanism - clipped softmax and gated attention.
We empirically show that models pre-trained using our methods learn
significantly smaller outliers while maintaining and sometimes even improving
the floating-point task performance. This enables us to quantize transformers
to full INT8 quantization of the activations without any additional effort. We
demonstrate the effectiveness of our methods on both language models (BERT,
OPT) and vision transformers.