ChatPaper.aiChatPaper

AccelOpt:一种用于AI加速器内核优化的自改进LLM智能体系统

AccelOpt: A Self-Improving LLM Agentic System for AI Accelerator Kernel Optimization

April 15, 2026
作者: Genghan Zhang, Shaowei Zhu, Anjiang Wei, Zhenyu Song, Allen Nie, Zhen Jia, Nandita Vijaykumar, Yida Wang, Kunle Olukotun
cs.AI

摘要

我们推出AccelOpt——一种具备自我优化能力的大型语言模型智能体系统,该系统能自主优化新兴AI加速器的计算内核,无需依赖专家提供的硬件特定优化知识。通过迭代生成机制,AccelOpt在优化记忆库的指导下探索内核优化空间,该记忆库系统化记录了从历史快慢内核对比中积累的经验与洞见。我们构建了NKIBench基准测试套件,其中包含从真实LLM工作负载提取的、具有不同复杂度的AWS Trainium加速器内核,用以评估AccelOpt效能。实验证实AccelOpt具备持续进化能力,在Trainium1上使NKIBench内核的峰值吞吐量占比从49%提升至61%,在Trainium2上从45%提升至59%。此外,该系统极具成本效益:使用开源模型即可达到Claude Sonnet 4的内核优化效果,而成本降低26倍。代码已开源:https://github.com/zhang677/AccelOpt。
English
We present AccelOpt, a self-improving large language model (LLM) agentic system that autonomously optimizes kernels for emerging AI acclerators, eliminating the need for expert-provided hardware-specific optimization knowledge. AccelOpt explores the kernel optimization space through iterative generation, informed by an optimization memory that curates experiences and insights from previously encountered slow-fast kernel pairs. We build NKIBench, a new benchmark suite of AWS Trainium accelerator kernels with varying complexity extracted from real-world LLM workloads to evaluate the effectiveness of AccelOpt. Our evaluation confirms that AccelOpt's capability improves over time, boosting the average percentage of peak throughput from 49% to 61% on Trainium 1 and from 45% to 59% on Trainium 2 for NKIBench kernels. Moreover, AccelOpt is highly cost-effective: using open-source models, it matches the kernel improvements of Claude Sonnet 4 while being 26times cheaper. The code is open-sourced at https://github.com/zhang677/AccelOpt.