KernelBench-X：面向大模型生成GPU内核的综合评估基准

摘要

基于大语言模型的Triton内核生成技术虽已引发广泛关注，但一个关键实证问题始终悬而未决：该技术的能力边界究竟在何处失效？其失效机理为何？我们推出KernelBench-X基准测试，通过对15个类别176项任务进行类别感知的正确性与硬件效率评估，旨在系统解答该问题。通过对五种代表性方法的系统性比较，我们获得三项核心发现：首先，任务结构对正确性的影响远超方法设计。任务类别对语义正确性的解释方差是方法差异的三倍（9.4% vs 3.3%），72%的融合任务在所有五种方法中均告失败，而数学类任务却始终保持稳定通过。其次，迭代优化能提升正确性但牺牲性能。在GEAK迭代过程中，编译成功率从52.3%升至68.8%，但平均加速比却从1.58倍降至1.44倍；新修复的内核性能持续低于始终正确的内核（第0至1轮迭代中加速比为1.16倍 vs 1.58倍）。第三，正确性不保证高效性。46.6%的正确内核慢于PyTorch即时执行基准，跨硬件加速比方差高达21.4倍。此外，量化任务虽具备一定编译成功率却完全未通过测试（0/30成功），这表明系统存在对数值计算契约的根本性误解，而非表面语法错误。这些发现预示未来进展需着眼于全局协调机制、显式数值精度建模以及硬件效率的生成融合。代码已开源：https://github.com/BonnieW05/KernelBenchX

English

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58times speedup in round~0to1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches 21.4times. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

KernelBench-X：面向大模型生成GPU内核的综合评估基准

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

摘要

Support