양자화 가능한 트랜스포머: 어텐션 헤드가 아무것도 하지 않도록 하여 이상치 제거

초록

트랜스포머 모델은 지난 몇 년 동안 다양한 분야에서 널리 채택되었으며, 특히 대규모 언어 모델은 AI 분야를 크게 발전시켰습니다. 이러한 네트워크의 규모로 인해 그 능력이 엄청나게 증가했지만, 이는 필요한 계산량의 상당한 증가라는 비용을 수반했습니다. 양자화(Quantization)는 신경망의 계산 시간과 메모리 소비를 줄이는 가장 효과적인 방법 중 하나입니다. 그러나 많은 연구에서 현대 트랜스포머 모델이 활성화(activation)에서 강한 이상치(outliers)를 학습하는 경향이 있어 양자화가 어렵다는 것을 보여주었습니다. 허용 가능한 성능을 유지하기 위해 이러한 이상치의 존재는 더 높은 비트폭(bitwidth)의 활성화, 다른 숫자 형식의 사용, 추가적인 미세 조정(fine-tuning) 또는 기타 해결 방법을 필요로 합니다. 우리는 이러한 강한 이상치가 "no-op"(아무 작업도 하지 않음) 또는 잔차(residual)의 부분적 업데이트를 학습하려는 어텐션 헤드(attention head)의 매우 특정한 행동과 관련이 있음을 보여줍니다. 업데이트가 없는 상태를 위해 어텐션 행렬에서 정확한 0을 달성하기 위해, 소프트맥스(softmax)의 입력이 훈련 중에 점점 더 커지도록 유도되며, 이는 네트워크의 다른 부분에서 이상치를 발생시킵니다. 이러한 관찰을 바탕으로, 우리는 어텐션 메커니즘에 두 가지 간단한 (독립적인) 수정을 제안합니다 - 클리핑된 소프트맥스(clipped softmax)와 게이트 어텐션(gated attention). 우리의 방법을 사용하여 사전 훈련된 모델이 부동소수점(floating-point) 작업 성능을 유지하거나 때로는 개선하면서도 상당히 작은 이상치를 학습한다는 것을 실증적으로 보여줍니다. 이를 통해 추가적인 노력 없이도 트랜스포머를 활성화의 완전한 INT8 양자화로 변환할 수 있습니다. 우리는 언어 모델(BERT, OPT)과 비전 트랜스포머(vision transformers) 모두에서 우리의 방법의 효과를 입증합니다.

English

Transformer models have been widely adopted in various domains over the last years, and especially large language models have advanced the field of AI significantly. Due to their size, the capability of these networks has increased tremendously, but this has come at the cost of a significant increase in necessary compute. Quantization is one of the most effective ways to reduce the computational time and memory consumption of neural networks. Many studies have shown, however, that modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize. To retain acceptable performance, the existence of these outliers requires activations to be in higher bitwidth or the use of different numeric formats, extra fine-tuning, or other workarounds. We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual. To achieve the exact zeros needed in the attention matrix for a no-update, the input to the softmax is pushed to be larger and larger during training, causing outliers in other parts of the network. Based on these observations, we propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention. We empirically show that models pre-trained using our methods learn significantly smaller outliers while maintaining and sometimes even improving the floating-point task performance. This enables us to quantize transformers to full INT8 quantization of the activations without any additional effort. We demonstrate the effectiveness of our methods on both language models (BERT, OPT) and vision transformers.

양자화 가능한 트랜스포머: 어텐션 헤드가 아무것도 하지 않도록 하여 이상치 제거

Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing

초록

Support