AgentKernelArena: 泛化感知的GPU内核优化代理基准测试

摘要

GPU内核优化对于高效的深度学习系统日益关键，但编写高性能内核仍需深厚的底层专业知识。当前AI编程智能体能够迭代读取代码、调用编译器和性能分析工具并优化实现，但现有内核基准测试仅评估单次大语言模型调用而非完整智能体工作流，且均未包含内核间优化与未见配置泛化测试。我们提出AgentKernelArena——一个用于评估AI编程智能体在GPU内核优化上表现的开源基准测试。该基准包含196个任务，涵盖HIP到HIP优化、Triton到Triton优化以及PyTorch到HIP转换，并通过门控编译、正确性和性能检查、集中评分以及测试优化能否迁移至智能体从未见过的输入配置的未见配置泛化协议，在隔离工作空间中评估完整智能体工作流。在Cursor Agent、Claude Code和Codex Agent等生产级智能体上，我们发现大多数任务类别实现近乎完美的编译与高正确率，最强配置在PyTorch到HIP、HIP到HIP和Triton到Triton任务上分别达到平均6.89倍、6.69倍和2.13倍加速。未见配置评估表明，HIP到HIP和Triton到Triton的优化大多能迁移至未见输入形状，而PyTorch到HIP则出现显著的正确率下降，提示从零生成内核的智能体常会硬编码形状相关假设。AgentKernelArena被设计为一个模块化、可扩展的框架，用于跨智能体、任务和硬件目标对智能体式GPU内核优化进行严格评估。

English

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.