AgentKernelArena: GPU 커널 최적화 에이전트의 일반화 인식 벤치마킹

초록

GPU 커널 최적화는 효율적인 딥러닝 시스템을 위해 점점 더 중요해지고 있지만, 고성능 커널을 작성하려면 여전히 상당한 수준의 저수준 전문 지식이 필요하다. 최근 AI 코딩 에이전트는 코드를 반복적으로 읽고, 컴파일러와 프로파일러를 호출하며, 구현을 개선할 수 있지만, 기존 커널 벤치마크는 전체 에이전트 워크플로가 아닌 단일 LLM 호출만 평가하며, 커널 간 최적화와 미관찰 설정 일반화 테스트를 모두 포함하는 벤치마크는 없다. 본 논문에서는 GPU 커널 최적화에서 AI 코딩 에이전트를 측정하기 위한 오픈소스 벤치마크인 AgentKernelArena를 제시한다. 이 벤치마크는 HIP 간 최적화, Triton 간 최적화, PyTorch-to-HIP 변환을 포괄하는 196개의 작업을 포함하며, 게이트형 컴파일, 정확성 및 성능 검사, 중앙 집중식 채점, 그리고 에이전트가 한 번도 관찰하지 못한 입력 설정으로 최적화가 전이되는지 테스트하는 미관찰 설정 일반화 프로토콜을 사용하여 격리된 작업 공간에서 전체 에이전트 워크플로를 평가한다. Cursor Agent, Claude Code, Codex Agent 등 상용 에이전트를 대상으로 한 평가에서 대부분의 작업 범주에서 거의 완벽한 컴파일과 높은 정확성을 확인했으며, 가장 강력한 구성은 PyTorch-to-HIP 작업에서 최대 6.89배, HIP 간 작업에서 6.69배, Triton 간 작업에서 2.13배의 평균 속도 향상을 달성했다. 미관찰 설정 평가 결과, HIP 간 및 Triton 간 최적화는 대부분 미관찰 입력 형태로 전이되는 반면, PyTorch-to-HIP는 정확성이 크게 떨어지는 것으로 나타났는데, 이는 에이전트가 처음부터 커널을 생성할 때 형태별 가정을 자주 하드코딩하기 때문이다. AgentKernelArena는 에이전트, 작업, 하드웨어 대상 전반에 걸친 에이전트형 GPU 커널 최적화의 엄격한 평가를 위한 모듈식이고 확장 가능한 프레임워크로 설계되었다.

English

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.