ChatPaper.aiChatPaper

CUDA 代理:面向高性能 CUDA 核心生成的大規模代理強化學習 (注:此處"Agentic RL"是強化學習領域新興術語,指代具有自主決策能力的智能體強化學習系統。為準確傳達技術內涵,採用直譯加領域共識的處理方式,既保留"代理"的能動性含義,又符合強化學習術語體系。)

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

February 27, 2026
作者: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou
cs.AI

摘要

GPU核心優化是現代深度學習的基礎,但至今仍是需要深厚硬體專業知識的高度專門化任務。儘管大型語言模型在通用程式設計領域表現出色,但在CUDA核心生成方面仍無法與基於編譯器的系統(如torch.compile)競爭。現有的CUDA程式碼生成方法要麼依賴免訓練的改進策略,要麼在固定的多輪執行回饋循環中對模型進行微調,但這兩種模式都未能從本質上提升模型的內在CUDA優化能力,導致性能提升有限。我們提出CUDA Agent——一個大規模代理強化學習系統,通過三大組件培養CUDA核心專業能力:可擴展的資料合成流程、具備自動驗證與性能分析功能的技能增強型CUDA開發環境(用於提供可靠獎勵信號),以及實現穩定訓練的強化學習演算法技術。CUDA Agent在KernelBench基準測試中取得突破性成果,於Level-1、Level-2和Level-3級別分別實現比torch.compile快100%、100%和92%的執行速率,在最困難的Level-3設定中更以約40%的優勢超越Claude Opus 4.5和Gemini 3 Pro等最強專有模型。
English
GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
PDF763March 7, 2026