基于Transformer的语言模型在CPU上的高效稀疏推理软件加速器

摘要

近年来，基于Transformer的语言模型已成为自然语言处理任务的标准方法。然而，在工业应用中，严格的吞吐量和延迟要求限制了它们的采用。为了弥合这一差距，模型压缩技术，如结构化剪枝，被用于提高推断效率。然而，大多数现有的神经网络推断运行时缺乏对结构化稀疏性的充分支持。在本文中，我们提出了一种高效的稀疏深度学习推断软件栈，适用于基于Transformer的语言模型，其中权重被以恒定块大小进行剪枝。我们的稀疏软件加速器利用Intel深度学习增强技术，以最大化在CPU上的稀疏矩阵 - 密集矩阵乘法（通常缩写为SpMM）的性能。我们的SpMM内核在5个代表性稀疏度比率（70％、75％、80％、85％、90％）下的各种GEMM形状上，比现有的稀疏库（oneMKL、TVM和LIBXSMM）提高一个数量级。此外，我们的SpMM内核在广泛使用的GEMM形状上比oneDNN的密集GEMM内核提供高达5倍的加速，后者是工业中广泛使用的经过优化的密集库。我们将我们的稀疏加速器应用于广泛使用的Transformer-based语言模型，包括Bert-Mini、DistilBERT、Bert-Base和BERT-Large。我们的稀疏推断软件在亚马逊网络服务的Xeon上，根据代理生产延迟约束条件，比Neural Magic的Deepsparse在相同配置下提供高达1.5倍的加速。我们还将我们的解决方案与两种基于框架的推断解决方案，ONNX Runtime和PyTorch进行比较，并在Xeon上根据延迟约束条件，比ONNX Runtime提供高达37倍的加速，比PyTorch提供高达345倍的加速。所有源代码都在Github上公开可用：https://github.com/intel/intel-extension-for-transformers。

English

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

基于Transformer的语言模型在CPU上的高效稀疏推理软件加速器

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

摘要

Support