让大语言模型像专家一样优化多场景CUDA内核
Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts
March 7, 2026
作者: Yuxuan Han, Meng-Hao Guo, Zhengning Liu, Wenguang Chen, Shi-Min Hu
cs.AI
摘要
手動優化GPU核心是一項極具挑戰且耗時的任務。隨著大型語言模型(LLMS)的快速發展,自動化GPU核心優化正逐漸成為可觸及的現實。然而,當前基於LLM的自動優化方法僅狹隘地聚焦於機器學習應用(如PyTorch算子優化),卻忽視了科學計算中稀疏矩陣運算等更廣泛的領域。向這些廣闊應用場景的拓展為基準測試與算法帶來了新挑戰。因此,開發通用型自動化核心優化方法成為我們的核心目標。本文通過構建MSKernelBench填補了多場景系統性評估的空白,該基準涵蓋基礎代數運算、常見LLM核心、稀疏矩陣算子及科學計算例程四大場景,且每個場景均支持FP32與BF16精度。基於此基準,我們提出CUDAMaster——一個多智能體、硬件感知的核心優化系統,它利用性能剖析信息並自動構建完整的編譯執行工具鏈。實驗結果表明,CUDAMaster在多數算子中實現顯著加速,性能較Astra提升約35%。在若干案例中,其表現可與cuBLAS等高度優化的閉源庫媲美甚至更優。各算子的原始代碼與優化代碼演示可見於:https://hanyx2021.github.io/MSKernelBenchDemo/。
English
Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi-agent, hardware-aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed-source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.