CUDA-L1：通过对比强化学习提升CUDA优化性能

摘要

随着大语言模型的快速发展，对GPU计算资源的需求呈指数级增长，这催生了对自动化CUDA优化策略的迫切需求。尽管近期大语言模型在代码生成方面展现出潜力，但当前最先进的模型（如R1、o1）在提升CUDA速度方面成功率较低。本文介绍了一种名为CUDA-L1的自动化强化学习框架，专为CUDA优化设计。 CUDA-L1在CUDA优化任务中实现了显著的性能提升：在NVIDIA A100上训练后，它在KernelBench的所有250个CUDA内核上平均加速17.7倍，峰值加速高达449倍。此外，该模型还展现了出色的跨GPU架构移植性，在H100、RTX 3090、L40、H800和H20上分别实现了平均17.8倍、19.0倍、16.5倍、14.7倍和13.9倍的加速，尽管其优化专门针对A100进行。除了这些基准测试结果，CUDA-L1还展示了几个显著特性：1）发现多种CUDA优化技术，并学会策略性地组合它们以达到最佳性能；2）揭示CUDA优化的基本原理；3）识别非显而易见的性能瓶颈，并拒绝看似有益实则损害性能的优化方案。 CUDA-L1的能力表明，仅通过基于加速的奖励信号，强化学习就能将初始表现不佳的大语言模型转变为高效的CUDA优化器，无需人类专业知识或领域知识。更重要的是，训练后的强化学习模型能够将习得的推理能力扩展到新内核上。这一范式为CUDA操作的自动化优化开辟了可能性，有望大幅提升GPU效率，缓解GPU计算资源日益增长的压力。

English

The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

CUDA-L1：通过对比强化学习提升CUDA优化性能

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

摘要

Support