在英特爾資料中心 GPU 上完全融合的多層感知器

摘要

本文介紹了一種針對英特爾數據中心 GPU Max 1550 進行優化的 SYCL 多層感知器（MLP）實現。為了提高性能，我們的實現通過在 MLP 的每一層中融合操作，最大程度地減少緩慢的全局內存訪問，從而最大化了通用寄存器文件和共享本地內存中的數據重用。我們通過簡單的屋頂線模型表明，這導致算術強度顯著提高，從而提高了性能，特別是對於推斷。我們將我們的方法與類似的 CUDA MLP 實現進行比較，並展示我們在英特爾數據中心 GPU 上的實現在推斷方面的性能優於 Nvidia 的 H100 GPU 上的 CUDA 實現最多達到 2.84 倍，而在訓練方面則最多達到 1.75 倍。本文還展示了我們的 SYCL 實現在三個重要領域的效率：圖像壓縮、神經輻射場和物理信息機器學習。在所有情況下，我們的實現在相同英特爾 GPU 上比 PyTorch 的 Intel 擴展（IPEX）實現高出多達 30 倍，在 Nvidia 的 H100 GPU 上比 CUDA PyTorch 版本高出多達 19 倍。代碼可在 https://github.com/intel/tiny-dpcpp-nn 找到。

English

This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.

在英特爾資料中心 GPU 上完全融合的多層感知器

Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs

摘要

Summary

Support

Support