다중 시나리오 CUDA 커널을 전문가처럼 최적화하는 LLM

초록

GPU 커널을 수동으로 최적화하는 것은 어렵고 시간이 많이 소요되는 작업입니다. LLM의 급속한 발전으로 자동화된 GPU 커널 최적화가 점차 현실화되고 있습니다. 그러나 현재 LLM 기반 자동 최적화 방법은 PyTorch 연산자 최적화와 같은 머신러닝 애플리케이션에만 집중하고 과학 컴퓨팅의 희소 행렬 연산과 같은 더 넓은 영역을 간과하고 있습니다. 이러한 광범위한 응용 분야로의 확장은 벤치마크와 알고리즘에 새로운 도전 과제를 제기합니다. 따라서 범용 자동 커널 최적화 방법의 개발이 우리의 주요 관심사가 되었습니다. 본 논문에서는 다중 시나리오 설정에 대한 체계적인 평가의 부재를 해결하기 위해 MSKernelBench를 소개합니다. 이 벤치마크는 기본 대수 연산, 일반적인 LLM 커널, 희소 행렬 연산자, 과학 컴퓨팅 루틴을 아우르며, 각각 FP32 및 BF16 정밀도를 모두 지원합니다. 이 벤치마크를 기반으로 프로파일링 정보를 활용하고 전체 컴파일 및 실행 도구 체인을 자동으로 구축하는 다중 에이전트 하드웨어 인식 커널 최적화 시스템인 CUDAMaster를 제안합니다. 실험 결과 CUDAMaster가 대부분의 연산자에서 상당한 속도 향상을 달성하며 Astra보다 약 35% 우수한 성능을 보여주었습니다. 여러 경우에서 그 성능이 cuBLAS와 같은 고도로 최적화된 독점 라이브러리의 성능에 필적하거나 이를 능가했습니다. 각 연산자에 대한 원본 및 최적화된 코드를 보여주는 데모는 https://hanyx2021.github.io/MSKernelBenchDemo/에서 확인할 수 있습니다.

English

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi-agent, hardware-aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed-source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.

다중 시나리오 CUDA 커널을 전문가처럼 최적화하는 LLM

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

초록

Support