KernelBench-X: LLM生成GPUカーネル評価のための包括的ベンチマーク

要旨

LLMベースのTritonカーネル生成は大きな関心を集めているが、根本的な実証的な疑問が未解決のまま残されている：この能力はどこで、なぜ破綻するのか？本論文では、この疑問に答えるために設計されたベンチマークKernelBench-Xを提案する。15カテゴリ・176タスクにおける正しさとハードウェア効率のカテゴリを考慮した評価を通じて検証する。5つの代表的手法の体系的な比較から、3つの主要な知見を得た。第一に、タスク構造は手法設計以上に正しさを決定する。カテゴリは、意味的正しさにおける分散の説明率が手法（3.3%）の約3倍（9.4%）に達し、全ての5手法でFusionタスクの72%が失敗する一方、Mathタスクは一貫して解決される。第二に、反復的な改良は正しさを向上させるが、性能は向上させない。GEAKの反復において、コンパイル成功率は52.3%から68.8%に上昇するが、平均高速化率は1.58倍から1.44倍に低下する。新たに救済されたカーネルは、一貫して正しいカーネルよりも一貫して低性能である（反復0→1での高速化率1.16倍 vs 1.58倍）。第三に、正しさは効率性を意味しない。正しいカーネルの46.6%がPyTorch eagerベースラインよりも遅く、ハードウェア間での高速化率の分散は21.4倍に達する。さらに、量子化は非自明なコンパイル成功率にもかかわらず完全に未解決（成功0/30）であり、表面的な構文エラーではなく、数値計算の契約に対する体系的な誤解を明らかにしている。これらの知見は、今後の進展には、大域的な協調の処理、数値精度の明示的なモデリング、およびハードウェア効率性の生成への組み込みが不可欠であることを示唆する。コードはhttps://github.com/BonnieW05/KernelBenchX で公開されている。

English

LLM-based Triton kernel generation has attracted significant interest, yet a fundamental empirical question remains unanswered: where does this capability break down, and why? We present KernelBench-X, a benchmark designed to answer this question through category-aware evaluation of correctness and hardware efficiency across 176 tasks in 15 categories. Our systematic comparison of five representative methods yields three main findings. First, task structure determines correctness more than method design. Category explains nearly three times more variance in semantic correctness than method (9.4% vs 3.3% explained deviance), and 72% of Fusion tasks fail across all five methods while Math tasks are solved consistently. Second, iterative refinement improves correctness, but not performance. Across GEAK iterations, compile rate rises from 52.3% to 68.8% while average speedup declines from 1.58times to 1.44times; newly rescued kernels consistently underperform persistently correct ones (1.16times vs 1.58times speedup in round~0to1). Third, correctness does not imply efficiency. 46.6% of correct kernels are slower than the PyTorch eager baseline, and cross-hardware speedup variance reaches 21.4times. Besides, quantization remains completely unsolved (0/30 successes) despite non-trivial compilation rates, revealing systematic misunderstanding of numerical computation contracts rather than surface-level syntax errors. These findings suggest that future progress depends on handling global coordination, explicitly modeling numerical precision, and incorporating hardware efficiency into generation. The code is available at https://github.com/BonnieW05/KernelBenchX

KernelBench-X: LLM生成GPUカーネル評価のための包括的ベンチマーク

KernelBench-X: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

要旨

Support