CUDA-L2: Superare le Prestazioni di cuBLAS per la Moltiplicazione di Matrici tramite Apprendimento per Rinforzo

Abstract

In questo articolo proponiamo CUDA-L2, un sistema che combina modelli linguistici di grandi dimensioni (LLM) e apprendimento per rinforzo (RL) per ottimizzare automaticamente i kernel CUDA Half-precision General Matrix Multiply (HGEMM). Utilizzando la velocità di esecuzione CUDA come ricompensa per l'RL, CUDA-L2 ottimizza automaticamente i kernel HGEMM su 1.000 configurazioni. CUDA-L2 supera sistematicamente i principali benchmark matmul fino ad oggi, dal diffusissimo {\it torch.matmul} alle librerie closed-source all'avanguardia di Nvidia, ovvero {\it cuBLAS} e {\it cuBLASLt}. In modalità offline, dove i kernel vengono eseguiti consecutivamente senza intervalli di tempo, CUDA-L2 produce un miglioramento medio del +22,0% rispetto a {\it torch.matmul}; +19,2% rispetto a {\it cuBLAS} utilizzando la configurazione di layout ottimale (normale-normale NN e trasposto-normale TN); +16,8% rispetto a {\it cuBLASLt-heuristic}, che interroga la libreria {\it cuBLASLt} e seleziona l'algoritmo in base al suggerimento dell'euristica; e +11,4% rispetto al più competitivo modello {\it cuBLASLt-AutoTuning}, che seleziona l'algoritmo più veloce tra fino a 100 candidati suggeriti da {\it cuBLASLt}. In modalità server, dove i kernel vengono eseguiti a intervalli casuali simulando l'inferenza in tempo reale, i miglioramenti di velocità aumentano ulteriormente a +28,7%, +26,0%, +22,4% e +15,9% rispettivamente per {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic} e {\it cuBLASLt-AutoTuning}. CUDA-L2 dimostra che anche kernel estremamente critici per le prestazioni e pesantemente ottimizzati come HGEMM possono essere migliorati attraverso l'automazione RL guidata da LLM, esplorando sistematicamente spazi di configurazione su scale impraticabili per gli esseri umani. Il progetto e il codice sono disponibili su github.com/deepreinforce-ai/CUDA-L2.

English

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

CUDA-L2: Superare le Prestazioni di cuBLAS per la Moltiplicazione di Matrici tramite Apprendimento per Rinforzo

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Abstract

Support