使用4位整数训练Transformer

摘要

将激活、权重和梯度量化为4位有望加速神经网络训练。然而，现有的4位训练方法需要定制的数值格式，这些格式不受当代硬件支持。在这项工作中，我们提出了一种用INT4算术实现所有矩阵乘法的transformer训练方法。使用超低的INT4精度进行训练具有挑战性。为了实现这一目标，我们仔细分析了transformer中激活和梯度的特定结构，为它们提出了专门的量化器。对于前向传播，我们确定了异常值的挑战，并提出了一种Hadamard量化器来抑制异常值。对于反向传播，我们利用梯度的结构稀疏性，提出了位分割和得分采样技术，以准确量化梯度。我们的算法在包括自然语言理解、机器翻译和图像分类在内的广泛任务上实现了竞争力的准确性。与先前的4位训练方法不同，我们的算法可以在当前一代GPU上实现。我们的原型线性运算符实现比FP16对应物快高达2.2倍，并将训练加速高达35.1%。

English

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2.2 times faster than the FP16 counterparts and speeds up the training by up to 35.1%.

使用4位整数训练Transformer

Training Transformers with 4-bit Integers

摘要

Support