让大语言模型像专家一样优化多场景CUDA内核

摘要

手动优化GPU内核是一项极具挑战且耗时的工作。随着大语言模型（LLM）的快速发展，自动化GPU内核优化正逐渐成为现实。然而当前基于LLM的自动化优化方法仅聚焦于机器学习应用（如PyTorch算子优化），忽视了科学计算中稀疏矩阵运算等更广泛的领域。向这些应用场景的拓展为基准测试和算法带来了新挑战，因此开发通用型自动化内核优化方法成为我们的研究重点。本文通过推出跨场景基准测试集MSKernelBench，填补了多场景系统化评估的空白。该基准涵盖基础代数运算、常见LLM内核、稀疏矩阵算子及科学计算例程四大场景，且每个场景均支持FP32与BF16两种精度。基于此基准，我们提出了CUDAMaster——一个具备硬件感知能力的多智能体内核优化系统，该系统能利用性能剖析信息自动构建完整的编译执行工具链。实验结果表明，CUDAMaster在多数算子中实现了显著加速，性能较Astra提升约35%。在多个案例中，其表现可与高度优化的闭源库（如cuBLAS）相媲美甚至更优。各算子的原始代码与优化版本演示可见：https://hanyx2021.github.io/MSKernelBenchDemo/

English

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi-agent, hardware-aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed-source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.