AgentKernelArena：具泛化意識的GPU核心優化代理基準測試

摘要

GPU核心優化對高效能深度學習系統日益關鍵，但撰寫高效核心仍需深厚的底層專業知識。近期AI編碼代理可反覆讀取程式碼、呼叫編譯器與分析工具、並逐步改善實作，然而現有核心基準測試僅評估單次大型語言模型呼叫，而非完整的代理工作流程，且均未包含核心對核心優化與未見配置泛化測試。我們提出AgentKernelArena，一個用於衡量AI編碼代理在GPU核心優化表現的開放原始碼基準測試。該基準包含196項任務，涵蓋HIP到HIP優化、Triton到Triton優化，以及PyTorch到HIP轉譯，並在隔離工作區中透過閘控編譯、正確性與效能檢查、集中計分，以及一項測試優化是否可轉移至代理未曾見過之輸入配置的未見配置泛化協定，來評估完整的代理工作流程。在包括Cursor Agent、Claude Code與Codex Agent在內的生產環境代理中，我們發現多數任務類別近乎完美的編譯率與高正確率，最強配置在PyTorch到HIP、HIP到HIP與Triton到Triton任務上分別達到平均6.89倍、6.69倍與2.13倍加速。未見配置評估顯示，HIP到HIP與Triton到Triton優化大致可遷移至未見輸入形狀，而PyTorch到HIP則出現顯著的正確性下降，顯示從頭生成核心的代理經常硬編碼與形狀相關的假設。AgentKernelArena被設計為一個模組化、可擴展的架構，用於對不同代理、任務與硬體目標進行嚴謹的代理式GPU核心優化評估。

English

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.