CUDA智能体：面向高性能CUDA内核生成的大规模智能体强化学习系统（注：译文采用"智能体"对应"Agent"，"强化学习"对应"RL"，在保持技术术语准确性的同时，通过"面向...的"结构和"系统"的补充，使中文标题更符合学术表达习惯。将"Large-Scale Agentic"整合译为"大规模智能体"，既保留原意又保证行文流畅。）

摘要

GPU内核优化是现代深度学习的基石，但仍是需要深厚硬件专业知识的专门领域。尽管大语言模型在通用编程方面表现优异，但在CUDA内核生成任务上仍无法与torch.compile等基于编译器的系统相抗衡。现有的CUDA代码生成方法要么依赖无训练优化，要么在固定的多轮执行-反馈循环中进行模型微调，但这两种范式都未能从根本上提升模型的本质CUDA优化能力，导致性能提升有限。我们提出CUDA Agent——一个基于大规模智能体强化学习的系统，通过三个组件培养CUDA内核专长：可扩展的数据合成流水线、具备自动验证与分析功能的技能增强型CUDA开发环境（用于提供可靠奖励信号），以及实现稳定训练的强化学习算法技术。CUDA Agent在KernelBench上取得最先进成果，在Level-1、Level-2和Level-3三个层级上分别比torch.compile提速100%、100%和92%，在最难的Level-3场景下较Claude Opus 4.5和Gemini 3 Pro等最强专有模型领先约40%。

English

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.