二进制和三进制自然语言生成
Binary and Ternary Natural Language Generation
June 2, 2023
作者: Zechun Liu, Barlas Oguz, Aasish Pappu, Yangyang Shi, Raghuraman Krishnamoorthi
cs.AI
摘要
三值和二值神经网络实现了无需乘法运算,如果在专用硬件上实现,可以比全精度网络获得多个数量级的效率提升。然而,由于参数和输出空间都高度离散化,这类网络很难进行优化。对于变压器文本生成模型,由于注意力操作对量化的敏感性以及自回归解码在高基数输出空间中的噪声叠加效应,这些困难变得更加复杂。我们采用基于统计的权重量化和激活弹性量化的混合方法来解决这个问题,并在摘要和机器翻译的下游任务上展示了第一个三值和二值变压器模型。我们的三值BART基准模型在CNN/DailyMail基准测试中获得了41的R1分数,仅比完整模型低3.9分,同时效率提高了16倍。我们的二值模型虽然准确性较低,但获得了35.6的非常可观分数。在机器翻译方面,我们在WMT16 En-Ro基准测试上获得了21.7和17.6的BLEU分数,而完整精度的mBART模型得分为26.8。我们还在8位激活设置中比较了我们的方法,在这种设置下,我们的三值甚至二值权重模型可以与文献中最佳的8位权重模型相匹敌或超越。我们的代码和模型可在以下链接找到:https://github.com/facebookresearch/Ternary_Binary_Transformer
English
Ternary and binary neural networks enable multiplication-free computation and
promise multiple orders of magnitude efficiency gains over full-precision
networks if implemented on specialized hardware. However, since both the
parameter and the output space are highly discretized, such networks have
proven very difficult to optimize. The difficulties are compounded for the
class of transformer text generation models due to the sensitivity of the
attention operation to quantization and the noise-compounding effects of
autoregressive decoding in the high-cardinality output space. We approach the
problem with a mix of statistics-based quantization for the weights and elastic
quantization of the activations and demonstrate the first ternary and binary
transformer models on the downstream tasks of summarization and machine
translation. Our ternary BART base achieves an R1 score of 41 on the
CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while
being 16x more efficient. Our binary model, while less accurate, achieves a
highly non-trivial score of 35.6. For machine translation, we achieved BLEU
scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full
precision mBART model score of 26.8. We also compare our approach in the 8-bit
activation setting, where our ternary and even binary weight models can match
or outperform the best existing 8-bit weight models in the literature. Our code
and models are available at:
https://github.com/facebookresearch/Ternary_Binary_Transformer