CUDA-L2: Superando o Desempenho do cuBLAS para Multiplicação de Matrizes por meio de Aprendizado por Reforço

Resumo

Neste artigo, propomos o CUDA-L2, um sistema que combina modelos de linguagem de grande escala (LLMs) e aprendizado por reforço (RL) para otimizar automaticamente kernels CUDA de Multiplicação de Matrizes Gerais em Precisão Half (HGEMM). Utilizando a velocidade de execução CUDA como recompensa do RL, o CUDA-L2 otimiza automaticamente kernels HGEMM em 1.000 configurações. O CUDA-L2 supera sistematicamente os principais benchmarks de multiplicação de matrizes até o momento, desde o amplamente utilizado {\it torch.matmul} até as bibliotecas de código fechado state-of-the-art da Nvidia, ou seja, {\it cuBLAS} e {\it cuBLASLt}. No modo offline, onde os kernels são executados consecutivamente sem intervalos de tempo, o CUDA-L2 apresenta um ganho médio de +22,0% em relação ao {\it torch.matmul}; +19,2% em relação ao {\it cuBLAS} usando a configuração de layout ideal (normal-normal NN e transposto-normal TN); +16,8% em relação ao {\it cuBLASLt-heuristic}, que consulta a biblioteca {\it cuBLASLt} e seleciona o algoritmo com base na sugestão heurística; e +11,4% em relação ao modelo mais competitivo, {\it cuBLASLt-AutoTuning}, que seleciona o algoritmo mais rápido entre até 100 candidatos das sugestões do {\it cuBLASLt}. No modo servidor, onde os kernels são executados em intervalos aleatórios simulando inferência em tempo real, os ganhos de velocidade aumentam ainda mais para +28,7%, +26,0%, +22,4% e +15,9% para {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic} e {\it cuBLASLt-AutoTuning}, respectivamente. O CUDA-L2 demonstra que mesmo kernels críticos para o desempenho e altamente otimizados, como o HGEMM, podem ser aprimorados por meio da automação com RL guiado por LLMs, explorando sistematicamente espaços de configuração em escalas impraticáveis para humanos. O projeto e o código estão disponíveis em github.com/deepreinforce-ai/CUDA-L2.

English

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2