Transformer Quantizzabili: Rimuovere gli Outlier Aiutando le Teste di Attenzione a Non Fare Nulla

Abstract

I modelli Transformer sono stati ampiamente adottati in vari ambiti negli ultimi anni, e in particolare i grandi modelli linguistici hanno fatto progredire significativamente il campo dell'IA. Grazie alle loro dimensioni, la capacità di queste reti è aumentata enormemente, ma ciò è avvenuto a costo di un significativo incremento delle risorse computazionali necessarie. La quantizzazione è uno dei metodi più efficaci per ridurre il tempo di calcolo e il consumo di memoria delle reti neurali. Tuttavia, molti studi hanno dimostrato che i moderni modelli Transformer tendono a imparare forti valori anomali nelle loro attivazioni, rendendoli difficili da quantizzare. Per mantenere prestazioni accettabili, la presenza di questi valori anomali richiede che le attivazioni siano rappresentate con una maggiore larghezza di bit, l'uso di formati numerici diversi, un ulteriore fine-tuning o altre soluzioni alternative. Mostriamo che i forti valori anomali sono legati a un comportamento molto specifico delle teste di attenzione che cercano di imparare un "no-op" o solo un aggiornamento parziale del residuo. Per ottenere gli zeri esatti necessari nella matrice di attenzione per un non-aggiornamento, l'input della softmax viene spinto a diventare sempre più grande durante l'addestramento, causando valori anomali in altre parti della rete. Sulla base di queste osservazioni, proponiamo due semplici (e indipendenti) modifiche al meccanismo di attenzione: la softmax limitata e l'attenzione gated. Dimostriamo empiricamente che i modelli pre-addestrati utilizzando i nostri metodi imparano valori anomali significativamente più piccoli, mantenendo e talvolta migliorando le prestazioni in virgola mobile. Ciò ci permette di quantizzare i Transformer con una quantizzazione completa INT8 delle attivazioni senza alcuno sforzo aggiuntivo. Dimostriamo l'efficacia dei nostri metodi sia sui modelli linguistici (BERT, OPT) che sui vision transformer.

English

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

Transformer Quantizzabili: Rimuovere gli Outlier Aiutando le Teste di Attenzione a Non Fare Nulla

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

Abstract

Support