使用4位元整數訓練Transformer
Training Transformers with 4-bit Integers
June 21, 2023
作者: Haocheng Xi, Changhao Li, Jianfei Chen, Jun Zhu
cs.AI
摘要
將激活、權重和梯度量化為4位元有望加速神經網絡訓練。然而,現有的4位元訓練方法需要自定義的數值格式,這些格式在當代硬體中不受支持。在這項工作中,我們提出了一種用 INT4 算術實現所有矩陣乘法的 transformer 訓練方法。使用超低 INT4 精度進行訓練具有挑戰性。為了實現這一點,我們仔細分析了 transformer 中激活和梯度的特定結構,並為它們提出了專用的量化器。對於前向傳播,我們確定了異常值的挑戰,並提出了 Hadamard 量化器來壓制這些異常值。對於反向傳播,我們利用梯度的結構稀疏性,提出了位元分割和得分抽樣技術,以精確量化梯度。我們的算法在包括自然語言理解、機器翻譯和圖像分類在內的各種任務上實現了競爭力的準確性。與先前的4位元訓練方法不同,我們的算法可以在當前一代 GPU 上實現。我們的原型線性運算符實現比 FP16 對應物快達 2.2 倍,並將訓練加速高達 35.1%。
English
Quantizing the activation, weight, and gradient to 4-bit is promising to
accelerate neural network training. However, existing 4-bit training methods
require custom numerical formats which are not supported by contemporary
hardware. In this work, we propose a training method for transformers with all
matrix multiplications implemented with the INT4 arithmetic. Training with an
ultra-low INT4 precision is challenging. To achieve this, we carefully analyze
the specific structures of activation and gradients in transformers to propose
dedicated quantizers for them. For forward propagation, we identify the
challenge of outliers and propose a Hadamard quantizer to suppress the
outliers. For backpropagation, we leverage the structural sparsity of gradients
by proposing bit splitting and leverage score sampling techniques to quantize
gradients accurately. Our algorithm achieves competitive accuracy on a wide
range of tasks including natural language understanding, machine translation,
and image classification. Unlike previous 4-bit training methods, our algorithm
can be implemented on the current generation of GPUs. Our prototypical linear
operator implementation is up to 2.2 times faster than the FP16 counterparts
and speeds up the training by up to 35.1%.