基于Transformer的语言模型在CPU上的高效稀疏推理软件加速器
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
June 28, 2023
作者: Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, Moshe Wasserblat
cs.AI
摘要
近年来,基于Transformer的语言模型已成为自然语言处理任务的标准方法。然而,在工业应用中,严格的吞吐量和延迟要求限制了它们的采用。为了弥合这一差距,模型压缩技术,如结构化剪枝,被用于提高推断效率。然而,大多数现有的神经网络推断运行时缺乏对结构化稀疏性的充分支持。在本文中,我们提出了一种高效的稀疏深度学习推断软件栈,适用于基于Transformer的语言模型,其中权重被以恒定块大小进行剪枝。我们的稀疏软件加速器利用Intel深度学习增强技术,以最大化在CPU上的稀疏矩阵 - 密集矩阵乘法(通常缩写为SpMM)的性能。我们的SpMM内核在5个代表性稀疏度比率(70%、75%、80%、85%、90%)下的各种GEMM形状上,比现有的稀疏库(oneMKL、TVM和LIBXSMM)提高一个数量级。此外,我们的SpMM内核在广泛使用的GEMM形状上比oneDNN的密集GEMM内核提供高达5倍的加速,后者是工业中广泛使用的经过优化的密集库。我们将我们的稀疏加速器应用于广泛使用的Transformer-based语言模型,包括Bert-Mini、DistilBERT、Bert-Base和BERT-Large。我们的稀疏推断软件在亚马逊网络服务的Xeon上,根据代理生产延迟约束条件,比Neural Magic的Deepsparse在相同配置下提供高达1.5倍的加速。我们还将我们的解决方案与两种基于框架的推断解决方案,ONNX Runtime和PyTorch进行比较,并在Xeon上根据延迟约束条件,比ONNX Runtime提供高达37倍的加速,比PyTorch提供高达345倍的加速。所有源代码都在Github上公开可用:https://github.com/intel/intel-extension-for-transformers。
English
In recent years, Transformer-based language models have become the standard
approach for natural language processing tasks. However, stringent throughput
and latency requirements in industrial applications are limiting their
adoption. To mitigate the gap, model compression techniques such as structured
pruning are being used to improve inference efficiency. However, most existing
neural network inference runtimes lack adequate support for structured
sparsity. In this paper, we propose an efficient sparse deep learning inference
software stack for Transformer-based language models where the weights are
pruned with constant block size. Our sparse software accelerator leverages
Intel Deep Learning Boost to maximize the performance of sparse matrix - dense
matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel
outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an
order of magnitude on a wide range of GEMM shapes under 5 representative
sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up
to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library
widely used in industry. We apply our sparse accelerator on widely-used
Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base,
and BERT-Large. Our sparse inference software shows up to 1.5x speedup over
Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web
Services under proxy production latency constraints. We also compare our
solution with two framework-based inference solutions, ONNX Runtime and
PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over
PyTorch on Xeon under the latency constraints. All the source code is publicly
available on Github: https://github.com/intel/intel-extension-for-transformers.