KernelBench-X：面向大模型生成GPU内核的综合评估基准

摘要

基于大语言模型的Triton内核生成技术虽备受关注，但一个基础性实证问题始终悬而未决：该能力在何处失效？原因何在？我们推出KernelBench-X基准测试，通过对15个类别176项任务进行类别感知的正确性与硬件效率评估，旨在解答这一问题。通过对五种代表性方法的系统比较，我们获得三项主要发现。首先，任务结构对正确性的影响远超方法设计。类别因素在语义正确性上解释的方差是方法因素的三倍（9.4% vs 3.3%），72%的融合任务在所有五种方法中均失败，而数学类任务却始终能成功解决。其次，迭代优化能提升正确性却牺牲性能。在GEAK迭代过程中，编译成功率从52.3%升至68.8%，但平均加速比从1.58倍降至1.44倍；新修复的内核性能持续低于始终正确的内核（第0至1轮迭代中加速比为1.16倍 vs 1.58倍）。第三，正确性不保证高效性。46.6%的正确内核慢于PyTorch即时执行基准，跨硬件加速比方差高达21.4倍。此外，量化任务虽具备一定编译成功率却完全未解决（0/30成功），这暴露出系统对数值计算契约的根本性误解，而非表面语法错误。这些发现表明，未来进展取决于处理全局协调、显式建模数值精度以及将硬件效率纳入生成过程。代码详见https://github.com/BonnieW05/KernelBenchX

English

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58times speedup in round~0to1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches 21.4times. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

KernelBench-X：面向大模型生成GPU内核的综合评估基准

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

摘要

Support