基于英特尔数据中心GPU的全连接多层感知器
Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs
March 26, 2024
作者: Kai Yuan, Christoph Bauinger, Xiangyi Zhang, Pascal Baehr, Matthias Kirchhart, Darius Dabert, Adrien Tousnakhoff, Pierre Boudier, Michael Paulitsch
cs.AI
摘要
本文介绍了一个面向英特尔数据中心GPU Max 1550优化的多层感知器(MLP)的SYCL实现。为了提高性能,我们的实现通过在MLP的每一层中融合操作,最大限度地减少了慢速全局内存访问,从而最大化了通用寄存器文件和共享本地内存中的数据重用。我们通过一个简单的屋顶线模型表明,这导致算术强度显著增加,从而提高了性能,特别是推断性能。我们将我们的方法与类似的用于MLP的CUDA实现进行了比较,并展示了我们在英特尔数据中心GPU上的实现在推断方面的性能优于Nvidia的H100 GPU上的CUDA实现最多达2.84倍,在训练方面最多达1.75倍。本文还展示了我们的SYCL实现在三个重要领域的效率:图像压缩、神经辐射场和物理信息机器学习。在所有情况下,我们的实现在相同英特尔GPU上的现成Intel PyTorch扩展(IPEX)实现方面优于最多30倍,并且在Nvidia的H100 GPU上的CUDA PyTorch版本方面优于最多19倍。代码可在https://github.com/intel/tiny-dpcpp-nn找到。
English
This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs),
which targets and is optimized for the Intel Data Center GPU Max 1550. To
increase the performance, our implementation minimizes the slow global memory
accesses by maximizing the data reuse within the general register file and the
shared local memory by fusing the operations in each layer of the MLP. We show
with a simple roofline model that this results in a significant increase in the
arithmetic intensity, leading to improved performance, especially for
inference. We compare our approach to a similar CUDA implementation for MLPs
and show that our implementation on the Intel Data Center GPU outperforms the
CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference
and 1.75 in training. The paper also showcases the efficiency of our SYCL
implementation in three significant areas: Image Compression, Neural Radiance
Fields, and Physics-Informed Machine Learning. In all cases, our implementation
outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation
on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on
Nvidia's H100 GPU by up to a factor 19. The code can be found at
https://github.com/intel/tiny-dpcpp-nn.Summary
AI-Generated Summary