CUDA-L2: Prestaties van cuBLAS voor matrixvermenigvuldiging overtreffen door middel van reinforcement learning

Samenvatting

In dit artikel presenteren wij CUDA-L2, een systeem dat grote taalmmodellen (LLM's) en reinforcement learning (RL) combineert om Half-precision General Matrix Multiply (HGEMM) CUDA-kernels automatisch te optimaliseren. Door de CUDA-uitvoersnelheid als RL-beloning te gebruiken, optimaliseert CUDA-L2 automatisch HGEMM-kernels over 1.000 configuraties. CUDA-L2 presteert systematisch beter dan de belangrijkste matmul-basislijnen tot op heden, van de veelgebruikte {\it torch.matmul} tot state-of-the-art gesloten bibliotheken van Nvidia, namelijk {\it cuBLAS} en {\it cuBLASLt}. In de offline modus, waar kernels opeenvolgend worden uitgevoerd zonder tijdsintervallen, behaalt CUDA-L2 gemiddeld een verbetering van +22,0\% ten opzichte van {\it torch.matmul}; +19,2\% ten opzichte van {\it cuBLAS} met de optimale layoutconfiguratie (normaal-normaal NN en getransponeerd-normaal TN); +16,8\% ten opzichte van {\it cuBLASLt-heuristic}, die de {\it cuBLASLt}-bibliotheek raadpleegt en het algoritme selecteert op basis van de suggestie van de heuristiek; en +11,4\% ten opzichte van het meest competitieve {\it cuBLASLt-AutoTuning}-model, dat het snelste algoritme selecteert uit maximaal 100 kandidaten uit de suggesties van {\it cuBLASLt}. In de servermodus, waar kernels met willekeurige tussenpozen worden uitgevoerd om real-time inferentie te simuleren, nemen de snelheidswinsten verder toe tot respectievelijk +28,7\%, +26,0\%, +22,4\% en +15,9\% voor {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic} en {\it cuBLASLt-AutoTuning}. CUDA-L2 toont aan dat zelfs de meest prestatiekritieke, zwaar geoptimaliseerde kernels zoals HGEMM kunnen worden verbeterd door LLM-gestuurde RL-automatisering, door configuratieruimtes op een schaal te verkennen die onpraktisch is voor mensen. Het project en de code zijn te vinden op github.com/deepreinforce-ai/CUDA-L2.

English

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

CUDA-L2: Prestaties van cuBLAS voor matrixvermenigvuldiging overtreffen door middel van reinforcement learning

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Samenvatting

Support