Volledig gefuseerde Multi-Layer Perceptrons op Intel Data Center GPU's

Samenvatting

Dit artikel presenteert een SYCL-implementatie van Multi-Layer Perceptrons (MLP's), die is gericht op en geoptimaliseerd is voor de Intel Data Center GPU Max 1550. Om de prestaties te verbeteren, minimaliseert onze implementatie de trage toegang tot het globale geheugen door het hergebruik van gegevens binnen het algemene registerbestand en het gedeelde lokale geheugen te maximaliseren, door de bewerkingen in elke laag van de MLP te fuseren. We tonen met een eenvoudig roofline-model aan dat dit resulteert in een significante toename van de rekenintensiteit, wat leidt tot verbeterde prestaties, vooral voor inferentie. We vergelijken onze aanpak met een vergelijkbare CUDA-implementatie voor MLP's en laten zien dat onze implementatie op de Intel Data Center GPU de CUDA-implementatie op Nvidia's H100 GPU overtreft met een factor tot 2,84 bij inferentie en 1,75 bij training. Het artikel toont ook de efficiëntie van onze SYCL-implementatie in drie belangrijke gebieden: beeldcompressie, Neural Radiance Fields en Physics-Informed Machine Learning. In alle gevallen overtreft onze implementatie de standaard Intel Extension for PyTorch (IPEX)-implementatie op dezelfde Intel GPU met een factor tot 30 en de CUDA PyTorch-versie op Nvidia's H100 GPU met een factor tot 19. De code is te vinden op https://github.com/intel/tiny-dpcpp-nn.

English

This paper presents a SYCL implementation of Multi-Layer Perceptrons (MLPs), which targets and is optimized for the Intel Data Center GPU Max 1550. To increase the performance, our implementation minimizes the slow global memory accesses by maximizing the data reuse within the general register file and the shared local memory by fusing the operations in each layer of the MLP. We show with a simple roofline model that this results in a significant increase in the arithmetic intensity, leading to improved performance, especially for inference. We compare our approach to a similar CUDA implementation for MLPs and show that our implementation on the Intel Data Center GPU outperforms the CUDA implementation on Nvidia's H100 GPU by a factor up to 2.84 in inference and 1.75 in training. The paper also showcases the efficiency of our SYCL implementation in three significant areas: Image Compression, Neural Radiance Fields, and Physics-Informed Machine Learning. In all cases, our implementation outperforms the off-the-shelf Intel Extension for PyTorch (IPEX) implementation on the same Intel GPU by up to a factor of 30 and the CUDA PyTorch version on Nvidia's H100 GPU by up to a factor 19. The code can be found at https://github.com/intel/tiny-dpcpp-nn.

Volledig gefuseerde Multi-Layer Perceptrons op Intel Data Center GPU's

Fully-fused Multi-Layer Perceptrons on Intel Data Center GPUs

Samenvatting

Support