CUDA-L1：通過對比強化學習提升CUDA優化效能

摘要

大型語言模型的快速發展推動了對GPU計算資源需求的指數級增長，這使得自動化CUDA優化策略成為當務之急。儘管近期LLM在代碼生成方面展現出潛力，但當前最先進的模型（如R1、o1）在提升CUDA速度方面的成功率仍然較低。本文介紹了CUDA-L1，這是一個用於CUDA優化的自動化強化學習框架。 CUDA-L1在CUDA優化任務上實現了顯著的性能提升：在NVIDIA A100上訓練後，它在KernelBench的所有250個CUDA內核上平均加速了17.7倍，峰值加速甚至達到449倍。此外，該模型還展現了出色的跨GPU架構可移植性，在H100、RTX 3090、L40、H800和H20上分別實現了17.8倍、19.0倍、16.5倍、14.7倍和13.9倍的平均加速，儘管其專門針對A100進行了優化。除了這些基準測試結果，CUDA-L1還展現了幾個顯著特性：1）發現了多種CUDA優化技術，並學會策略性地組合它們以實現最佳性能；2）揭示了CUDA優化的基本原理；3）識別了非顯而易見的性能瓶頸，並拒絕了看似有益但實際上會損害性能的優化方案。 CUDA-L1的能力表明，強化學習僅基於加速獎勵信號，就能將初始表現不佳的LLM轉變為有效的CUDA優化器，而無需人類專家的介入或領域知識。更重要的是，訓練後的RL模型能夠將獲得的推理能力擴展到新的內核上。這一範式為CUDA操作的自動化優化開闢了可能性，並有望大幅提升GPU效率，緩解GPU計算資源日益增長的壓力。

English

The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

CUDA-L1：通過對比強化學習提升CUDA優化效能

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

摘要

Support