CUDA 代理：面向高性能 CUDA 核心生成的大規模代理強化學習（注：此處"Agentic RL"是強化學習領域新興術語，指代具有自主決策能力的智能體強化學習系統。為準確傳達技術內涵，採用直譯加領域共識的處理方式，既保留"代理"的能動性含義，又符合強化學習術語體系。）

摘要

GPU核心優化是現代深度學習的基礎，但至今仍是需要深厚硬體專業知識的高度專門化任務。儘管大型語言模型在通用程式設計領域表現出色，但在CUDA核心生成方面仍無法與基於編譯器的系統（如torch.compile）競爭。現有的CUDA程式碼生成方法要麼依賴免訓練的改進策略，要麼在固定的多輪執行回饋循環中對模型進行微調，但這兩種模式都未能從本質上提升模型的內在CUDA優化能力，導致性能提升有限。我們提出CUDA Agent——一個大規模代理強化學習系統，通過三大組件培養CUDA核心專業能力：可擴展的資料合成流程、具備自動驗證與性能分析功能的技能增強型CUDA開發環境（用於提供可靠獎勵信號），以及實現穩定訓練的強化學習演算法技術。CUDA Agent在KernelBench基準測試中取得突破性成果，於Level-1、Level-2和Level-3級別分別實現比torch.compile快100%、100%和92%的執行速率，在最困難的Level-3設定中更以約40%的優勢超越Claude Opus 4.5和Gemini 3 Pro等最強專有模型。

English

GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

摘要

Support