AgentKernelArena: 汎化を考慮したGPUカーネル最適化エージェントのベンチマーキング

要旨

GPUカーネルの最適化は、効率的なディープラーニングシステムにとってますます重要になっているが、高性能カーネルを記述するには依然として高度な低レベルの専門知識が必要である。近年のAIコーディングエージェントは、コードを反復的に読み込み、コンパイラやプロファイラを呼び出し、実装を洗練させることができるが、既存のカーネルベンチマークは完全なエージェントワークフローではなく単一のLLM呼び出しを評価しており、カーネル間最適化と未観測構成の汎化テストの両方を含むものはない。本稿では、GPUカーネル最適化におけるAIコーディングエージェントを測定するためのオープンソースベンチマークAgentKernelArenaを提案する。このベンチマークは、HIPからHIPへの最適化、TritonからTritonへの最適化、PyTorchからHIPへの翻訳にわたる196のタスクを含み、ゲート付きコンパイル、正しさ、パフォーマンスチェックを使用した分離ワークスペースでの完全なエージェントワークフロー、集中スコアリング、および最適化がエージェントが一度も観測したことのない入力構成に転送されるかどうかをテストする未観測構成汎化プロトコルを評価する。Cursor Agent、Claude Code、Codex Agentなどのプロダクションエージェントにおいて、ほとんどのタスクカテゴリでほぼ完全なコンパイルと高い正しさの割合が見られ、最も強力な構成はPyTorchからHIPへのタスクで平均6.89倍、HIPからHIPへのタスクで6.69倍、TritonからTritonへのタスクで2.13倍のスピードアップを達成した。未観測構成の評価では、HIPからHIPおよびTritonからTritonの最適化は未観測の入力形状に概ね転送される一方、PyTorchからHIPでは正しさが大幅に低下し、エージェントがゼロからカーネルを生成する際に形状固有の仮定をハードコードすることが多いことを示している。AgentKernelArenaは、エージェント、タスク、ハードウェアターゲットにわたるエージェント型GPUカーネル最適化の厳密な評価のためのモジュール式で拡張可能なフレームワークとして設計されている。

English

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.