이진 및 삼진 자연어 생성

초록

삼항 및 이진 신경망은 전용 하드웨어에서 구현될 경우 곱셈 연산 없이 계산이 가능하며, 완전 정밀도 네트워크 대비 수 차례의 효율성 향상을 약속합니다. 그러나 매개변수와 출력 공간이 모두 고도로 이산화되어 있어, 이러한 네트워크의 최적화는 매우 어려운 것으로 입증되었습니다. 이러한 어려움은 트랜스포머 텍스트 생성 모델의 경우, 양자화에 대한 어텐션 연산의 민감성과 고차원 출력 공간에서의 자기회귀 디코딩으로 인한 노이즈 누적 효과로 인해 더욱 복잡해집니다. 우리는 이 문제를 가중치에 대한 통계 기반 양자화와 활성화에 대한 탄력적 양자화를 혼합하여 접근하고, 요약 및 기계 번역과 같은 하위 작업에서 최초의 삼항 및 이진 트랜스포머 모델을 시연합니다. 우리의 삼항 BART base 모델은 CNN/DailyMail 벤치마크에서 R1 점수 41을 달성했으며, 이는 완전 모델 대비 단 3.9점 차이로 16배 더 효율적입니다. 이진 모델은 정확도는 낮지만 35.6이라는 상당히 의미 있는 점수를 달성했습니다. 기계 번역의 경우, WMT16 En-Ro 벤치마크에서 BLEU 점수 21.7과 17.6을 달성했으며, 이는 완전 정밀도 mBART 모델의 점수인 26.8과 비교됩니다. 또한, 8비트 활성화 설정에서 우리의 접근 방식을 비교했을 때, 삼항 및 이진 가중치 모델이 기존 문헌에서 최고의 8비트 가중치 모델과 견줄 만하거나 이를 능가할 수 있음을 보여줍니다. 우리의 코드와 모델은 https://github.com/facebookresearch/Ternary_Binary_Transformer에서 확인할 수 있습니다.

English

Ternary and binary neural networks enable multiplication-free computation and promise multiple orders of magnitude efficiency gains over full-precision networks if implemented on specialized hardware. However, since both the parameter and the output space are highly discretized, such networks have proven very difficult to optimize. The difficulties are compounded for the class of transformer text generation models due to the sensitivity of the attention operation to quantization and the noise-compounding effects of autoregressive decoding in the high-cardinality output space. We approach the problem with a mix of statistics-based quantization for the weights and elastic quantization of the activations and demonstrate the first ternary and binary transformer models on the downstream tasks of summarization and machine translation. Our ternary BART base achieves an R1 score of 41 on the CNN/DailyMail benchmark, which is merely 3.9 points behind the full model while being 16x more efficient. Our binary model, while less accurate, achieves a highly non-trivial score of 35.6. For machine translation, we achieved BLEU scores of 21.7 and 17.6 on the WMT16 En-Ro benchmark, compared with a full precision mBART model score of 26.8. We also compare our approach in the 8-bit activation setting, where our ternary and even binary weight models can match or outperform the best existing 8-bit weight models in the literature. Our code and models are available at: https://github.com/facebookresearch/Ternary_Binary_Transformer

이진 및 삼진 자연어 생성

Binary and Ternary Natural Language Generation

초록

Support