CUDA-L2: Superando el Rendimiento de cuBLAS en Multiplicación de Matrices mediante Aprendizaje por Refuerzo

Resumen

En este artículo, presentamos CUDA-L2, un sistema que combina modelos de lenguaje extensos (LLM) y aprendizaje por refuerzo (RL) para optimizar automáticamente kernels CUDA de Multiplicación General de Matrices en Precisión Media (HGEMM). Utilizando la velocidad de ejecución de CUDA como recompensa del RL, CUDA-L2 optimiza automáticamente kernels HGEMM en 1.000 configuraciones. CUDA-L2 supera sistemáticamente los principales baselines de multiplicación de matrices hasta la fecha, desde el ampliamente utilizado {\it torch.matmul} hasta las bibliotecas cerradas de última generación de Nvidia, es decir, {\it cuBLAS} y {\it cuBLASLt}. En modo offline, donde los kernels se ejecutan consecutivamente sin intervalos de tiempo, CUDA-L2 produce una mejora promedio del +22,0\% sobre {\it torch.matmul}; +19,2\% sobre {\it cuBLAS} utilizando la configuración de disposición óptima (normal-normal NN y transpuesta-normal TN); +16,8\% sobre {\it cuBLASLt-heurístico}, que consulta la biblioteca {\it cuBLASLt} y selecciona el algoritmo basándose en la sugerencia heurística; y +11,4\% sobre el modelo más competitivo, {\it cuBLASLt-AutoTuning}, que selecciona el algoritmo más rápido entre hasta 100 candidatos de las sugerencias de {\it cuBLASLt}. En modo servidor, donde los kernels se ejecutan a intervalos aleatorios simulando inferencia en tiempo real, las aceleraciones aumentan aún más a +28,7\%, +26,0\%, +22,4\% y +15,9\% para {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heurístico} y {\it cuBLASLt-AutoTuning}, respectivamente. CUDA-L2 demuestra que incluso los kernels más críticos en rendimiento y altamente optimizados, como HGEMM, pueden mejorarse mediante la automatización de RL guiada por LLM, explorando sistemáticamente espacios de configuración a escalas impracticables para humanos. El proyecto y el código se pueden encontrar en github.com/deepreinforce-ai/CUDA-L2.

English

In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2

CUDA-L2: Superando el Rendimiento de cuBLAS en Multiplicación de Matrices mediante Aprendizaje por Refuerzo

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Resumen

Support