CUDA-L2:基于强化学习的矩阵乘法性能超越cuBLAS实现
CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning
December 2, 2025
作者: Songqiao Su, Xiaofei Sun, Xiaoya Li, Albert Wang, Jiwei Li, Chris Shum
cs.AI
摘要
本文提出CUDA-L2系统,通过结合大语言模型(LLM)与强化学习(RL)实现半精度通用矩阵乘(HGEMM)CUDA核函数的自动优化。该系统以CUDA执行速度为RL奖励,在1,000种配置下自动优化HGEMM核函数。实验表明,CUDA-L2在多项基准测试中均优于当前主流矩阵乘法方案:从广泛使用的{\it torch.matmul}到英伟达最新的闭源库(如{\it cuBLAS}和{\it cuBLASLt})。在离线模式下(核函数连续无间隔执行),CUDA-L2相较{\it torch.matmul}平均提升22.0%;在最优布局配置(正常-正常NN与转置-正常TN)下较{\it cuBLAS}提升19.2%;较基于启发式算法选择的{\it cuBLASLt-heuristic}提升16.8%;较从{\it cuBLASLt}提供的百个候选算法中择优的{\it cuBLASLt-AutoTuning}模型提升11.4%。在模拟实时推理的随机间隔执行服务器模式下,加速效果进一步提升:相较{\it torch.matmul}、{\it cuBLAS}、{\it cuBLASLt-heuristic}和{\it cuBLASLt-AutoTuning}分别达到28.7%、26.0%、22.4%和15.9%的加速比。CUDA-L2证明即使对于HGEMM这类经过深度优化的性能关键型核函数,通过LLM引导的RL自动化技术对配置空间进行超大规模探索(其规模远超人工调优能力),仍能实现性能突破。项目代码详见github.com/deepreinforce-ai/CUDA-L2。
English
In this paper, we propose CUDA-L2, a system that combines large language models (LLMs) and reinforcement learning (RL) to automatically optimize Half-precision General Matrix Multiply (HGEMM) CUDA kernels. Using CUDA execution speed as the RL reward, CUDA-L2 automatically optimizes HGEMM kernels across 1,000 configurations. CUDA-L2 systematically outperforms major matmul baselines to date, from the widely-used {\it torch.matmul} to state-of-the-art Nvidia's closed-source libraries, i.e., {\it cuBLAS}, {\it cuBLASLt}. In offline mode, where kernels are executed consecutively without time intervals, CUDA-L2 yields +22.0\% over {\it torch.matmul} on average; +19.2\% over {\it cuBLAS} using the optimal layout configuration (normal-normal NN and transposed-normal TN); +16.8\% over {\it cuBLASLt-heuristic}, which queries {\it cuBLASLt} library and selects the algorithm based on the heuristic's suggestion; and +11.4\% over the most competitive {\it cuBLASLt-AutoTuning} model, which selects the fastest algorithm from up to 100 candidates from {\it cuBLASLt}'s suggestions. In server mode, where kernels are executed at random intervals simulating real-time inference, the speedups further increase to +28.7\%, +26.0\%, +22.4\%, and +15.9\% for {\it torch.matmul}, {\it cuBLAS}, {\it cuBLASLt-heuristic}, and {\it cuBLASLt-AutoTuning} respectively. CUDA-L2 shows that even the most performance-critical, heavily-optimized kernels like HGEMM can be improved through LLM-guided RL automation by systematically exploring configuration spaces at scales impractical for humans. Project and code can be found at github.com/deepreinforce-ai/CUDA-L2