使用4位元整數訓練Transformer

摘要

將激活、權重和梯度量化為4位元有望加速神經網絡訓練。然而，現有的4位元訓練方法需要自定義的數值格式，這些格式在當代硬體中不受支持。在這項工作中，我們提出了一種用 INT4 算術實現所有矩陣乘法的 transformer 訓練方法。使用超低 INT4 精度進行訓練具有挑戰性。為了實現這一點，我們仔細分析了 transformer 中激活和梯度的特定結構，並為它們提出了專用的量化器。對於前向傳播，我們確定了異常值的挑戰，並提出了 Hadamard 量化器來壓制這些異常值。對於反向傳播，我們利用梯度的結構稀疏性，提出了位元分割和得分抽樣技術，以精確量化梯度。我們的算法在包括自然語言理解、機器翻譯和圖像分類在內的各種任務上實現了競爭力的準確性。與先前的4位元訓練方法不同，我們的算法可以在當前一代 GPU 上實現。我們的原型線性運算符實現比 FP16 對應物快達 2.2 倍，並將訓練加速高達 35.1%。

English

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

使用4位元整數訓練Transformer

Training Transformers with 4-bit Integers

摘要

Support