4비트 정수로 트랜스포머 학습하기

초록

활성화, 가중치, 그래디언트를 4비트로 양자화하는 것은 신경망 학습을 가속화할 수 있는 유망한 방법입니다. 그러나 기존의 4비트 학습 방법들은 현대 하드웨어에서 지원되지 않는 사용자 정의 수치 형식을 필요로 합니다. 본 연구에서는 모든 행렬 곱셈이 INT4 연산으로 구현된 트랜스포머 학습 방법을 제안합니다. 초저정밀도 INT4로 학습하는 것은 도전적인 과제입니다. 이를 달성하기 위해, 우리는 트랜스포머의 활성화와 그래디언트의 특정 구조를 면밀히 분석하여 이를 위한 전용 양자화기를 제안합니다. 순전파에서는 이상치 문제를 식별하고, 이를 억제하기 위해 Hadamard 양자화기를 제안합니다. 역전파에서는 그래디언트의 구조적 희소성을 활용하여 비트 분할 및 레버리지 점수 샘플링 기법을 제안하여 그래디언트를 정확하게 양자화합니다. 우리의 알고리즘은 자연어 이해, 기계 번역, 이미지 분류 등 다양한 작업에서 경쟁력 있는 정확도를 달성합니다. 이전의 4비트 학습 방법들과 달리, 우리의 알고리즘은 현재 세대의 GPU에서 구현될 수 있습니다. 우리의 프로토타입 선형 연산자 구현은 FP16 대비 최대 2.2배 빠르며, 학습 속도를 최대 35.1%까지 향상시킵니다.

English

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

4비트 정수로 트랜스포머 학습하기

Training Transformers with 4-bit Integers

초록

Support