CUDA智能体:面向高性能CUDA内核生成的大规模智能体强化学习系统 (注:译文采用"智能体"对应"Agent","强化学习"对应"RL",在保持技术术语准确性的同时,通过"面向...的"结构和"系统"的补充,使中文标题更符合学术表达习惯。将"Large-Scale Agentic"整合译为"大规模智能体",既保留原意又保证行文流畅。)
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
February 27, 2026
作者: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
cs.AI
摘要
GPU内核优化是现代深度学习的基石,但仍是需要深厚硬件专业知识的专门领域。尽管大语言模型在通用编程方面表现优异,但在CUDA内核生成任务上仍无法与torch.compile等基于编译器的系统相抗衡。现有的CUDA代码生成方法要么依赖无训练优化,要么在固定的多轮执行-反馈循环中进行模型微调,但这两种范式都未能从根本上提升模型的本质CUDA优化能力,导致性能提升有限。我们提出CUDA Agent——一个基于大规模智能体强化学习的系统,通过三个组件培养CUDA内核专长:可扩展的数据合成流水线、具备自动验证与分析功能的技能增强型CUDA开发环境(用于提供可靠奖励信号),以及实现稳定训练的强化学习算法技术。CUDA Agent在KernelBench上取得最先进成果,在Level-1、Level-2和Level-3三个层级上分别比torch.compile提速100%、100%和92%,在最难的Level-3场景下较Claude Opus 4.5和Gemini 3 Pro等最强专有模型领先约40%。
English
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.