Far sì che i LLM Ottimizzino i Kernel CUDA Multi-Scenario come Esperti

Abstract

L'ottimizzazione manuale dei kernel GPU è un compito complesso e che richiede tempo. Con il rapido sviluppo degli LLM, l'ottimizzazione automatizzata dei kernel GPU sta gradualmente diventando una realtà tangibile. Tuttavia, gli attuali metodi di ottimizzazione automatizzata basati su LLM si concentrano in modo restrittivo su applicazioni di machine learning, come l'ottimizzazione degli operatori PyTorch, trascurando ambiti più ampi come le operazioni su matrici sparse nel calcolo scientifico. L'estensione a queste applicazioni più vaste comporta nuove sfide per i benchmark e gli algoritmi. Pertanto, lo sviluppo di un metodo di ottimizzazione automatizzata dei kernel a scopo generale diventa il nostro obiettivo primario. In questo articolo, affrontiamo l'assenza di una valutazione sistematica per impostazioni multi-scenario introducendo MSKernelBench, che copre molteplici scenari, incluse operazioni algebriche fondamentali, kernel LLM comuni, operatori per matrici sparse e routine di calcolo scientifico, ciascuno con supporto per precisione FP32 e BF16. Basandoci su questo benchmark, introduciamo CUDAMaster, un sistema multi-agente e hardware-aware per l'ottimizzazione dei kernel che sfrutta le informazioni di profilazione e costruisce automaticamente l'intera toolchain di compilazione ed esecuzione. I risultati sperimentali dimostrano che CUDAMaster raggiunge significativi miglioramenti di velocità sulla maggior parte degli operatori, superando Astra di circa il 35%. In diversi casi, le sue prestazioni eguagliano o superano quelle di librerie altamente ottimizzate e closed-source come cuBLAS. Una demo che mostra il codice originale e ottimizzato per ciascun operatore è disponibile all'indirizzo https://hanyx2021.github.io/MSKernelBenchDemo/.

English

Optimizing GPU kernels manually is a challenging and time-consuming task. With the rapid development of LLMs, automated GPU kernel optimization is gradually becoming a tangible reality. However, current LLM-driven automated optimization methods narrowly focus on machine learning applications, such as PyTorch operator optimization, while overlooking broader domains like sparse matrix operations in scientific computing. Extending to these broader applications brings new challenges for the benchmark and algorithm. Therefore, developing a general-purpose automated kernel optimization method becomes our primary focus. In this paper, we address the absence of systematic evaluation for multi-scenario settings by introducing MSKernelBench, which spans multiple scenarios, including fundamental algebraic operations, common LLM kernels, sparse matrix operators, and scientific computing routines, each supporting both FP32 and BF16 precision. Building on this benchmark, we introduce CUDAMaster, a multi-agent, hardware-aware system for kernel optimization that leverages profiling information and automatically constructs the full compilation and execution toolchain. Experimental results demonstrate that CUDAMaster achieves significant speedups across most operators, outperforming Astra by about 35%. In several cases, its performance matches or surpasses that of highly optimized, closed-source libraries such as cuBLAS. A demo showcasing the original and optimized code for each operator is available at https://hanyx2021.github.io/MSKernelBenchDemo/.

Far sì che i LLM Ottimizzino i Kernel CUDA Multi-Scenario come Esperti

Making LLMs Optimize Multi-Scenario CUDA Kernels Like Experts

Abstract

Support