一種針對基於Transformer的語言模型在CPU上進行高效稀疏推論的軟體加速器
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
June 28, 2023
作者: Haihao Shen, Hengyu Meng, Bo Dong, Zhe Wang, Ofir Zafrir, Yi Ding, Yu Luo, Hanwen Chang, Qun Gao, Ziheng Wang, Guy Boudoukh, Moshe Wasserblat
cs.AI
摘要
近年來,基於Transformer的語言模型已成為自然語言處理任務的標準方法。然而,在工業應用中對吞吐量和延遲的嚴格要求限制了它們的應用。為了彌補這一差距,模型壓縮技術,如結構化剪枝,被用來提高推理效率。然而,大多數現有的神經網絡推理運行時缺乏對結構化稀疏性的充分支持。在本文中,我們提出了一個高效的稀疏深度學習推理軟件堆棧,適用於基於Transformer的語言模型,其中權重被以恆定塊大小進行剪枝。我們的稀疏軟件加速器利用Intel Deep Learning Boost來最大化在CPU上的稀疏矩陣-稠密矩陣乘法(通常縮寫為SpMM)的性能。我們的SpMM核心在5個代表性稀疏率(70%、75%、80%、85%、90%)下的廣泛範圍的GEMM形狀上,優於現有的稀疏庫(oneMKL、TVM和LIBXSMM)一個數量級。此外,我們的SpMM核心在廣泛應用於工業的優化稠密庫oneDNN的密集GEMM核心上實現了高達5倍的加速。我們將我們的稀疏加速器應用於廣泛使用的Transformer-based語言模型,包括Bert-Mini、DistilBERT、Bert-Base和BERT-Large。我們的稀疏推理軟件在Amazon Web Services的Xeon上,在代理生產延遲約束條件下,與Neural Magic的Deepsparse相比,實現了高達1.5倍的加速。我們還將我們的解決方案與兩種基於框架的推理解決方案,ONNX Runtime和PyTorch進行比較,並在Xeon上在延遲約束條件下實現了高達37倍的加速,並且在PyTorch上實現了345倍的加速。所有源代碼都公開在Github上:https://github.com/intel/intel-extension-for-transformers。
English
In recent years, Transformer-based language models have become the standard
approach for natural language processing tasks. However, stringent throughput
and latency requirements in industrial applications are limiting their
adoption. To mitigate the gap, model compression techniques such as structured
pruning are being used to improve inference efficiency. However, most existing
neural network inference runtimes lack adequate support for structured
sparsity. In this paper, we propose an efficient sparse deep learning inference
software stack for Transformer-based language models where the weights are
pruned with constant block size. Our sparse software accelerator leverages
Intel Deep Learning Boost to maximize the performance of sparse matrix - dense
matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel
outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an
order of magnitude on a wide range of GEMM shapes under 5 representative
sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up
to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library
widely used in industry. We apply our sparse accelerator on widely-used
Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base,
and BERT-Large. Our sparse inference software shows up to 1.5x speedup over
Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web
Services under proxy production latency constraints. We also compare our
solution with two framework-based inference solutions, ONNX Runtime and
PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over
PyTorch on Xeon under the latency constraints. All the source code is publicly
available on Github: https://github.com/intel/intel-extension-for-transformers.