CUDA-L1: 対比強化学習によるCUDA最適化の改善

要旨

大規模言語モデルの急速な進展に伴うGPUコンピューティングリソース需要の指数関数的な増加は、自動化されたCUDA最適化戦略の緊急な必要性を生み出しています。最近のLLMの進歩はコード生成において有望ではあるものの、現在のSOTAモデル（例：R1、o1）はCUDAの高速化において低い成功率に留まっています。本論文では、CUDA最適化のための自動強化学習フレームワークであるCUDA-L1を紹介します。 CUDA-L1はCUDA最適化タスクにおいて性能向上を達成します：NVIDIA A100でトレーニングされたCUDA-L1は、KernelBenchの250のCUDAカーネル全体で平均17.7倍の高速化を実現し、ピーク時には449倍の高速化に達します。さらに、このモデルはGPUアーキテクチャ間での優れた移植性も示し、A100向けに最適化されているにもかかわらず、H100で平均17.8倍、RTX 3090で19.0倍、L40で16.5倍、H800で14.7倍、H20で13.9倍の高速化を達成します。これらのベンチマーク結果を超えて、CUDA-L1はいくつかの注目すべき特性を示します：1）多様なCUDA最適化技術を発見し、それらを戦略的に組み合わせて最適な性能を達成することを学習する；2）CUDA最適化の基本原理を解明する；3）非自明な性能ボトルネックを特定し、性能を損なう一見有益な最適化を拒否する。 CUDA-L1の能力は、強化学習が人間の専門知識やドメイン知識なしに、速度向上に基づく報酬信号のみを通じて、当初は性能の低いLLMを効果的なCUDA最適化ツールに変えることができることを示しています。さらに重要なことに、トレーニングされたRLモデルは、獲得した推論能力を新しいカーネルに拡張します。このパラダイムは、CUDA操作の自動最適化の可能性を開き、GPU効率を大幅に向上させ、GPUコンピューティングリソースに対する増大する圧力を緩和することを約束します。

English

The exponential growth in demand for GPU computing resources, driven by the rapid advancement of Large Language Models, has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models (e.g. R1, o1) achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization. CUDA-L1 achieves performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x17.7 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x449. Furthermore, the model also demonstrates excellent portability across GPU architectures, achieving average speedups of x17.8 on H100, x19.0 on RTX 3090, x16.5 on L40, x14.7 on H800, and x13.9 on H20 despite being optimized specifically for A100. Beyond these benchmark results, CUDA-L1 demonstrates several remarkable properties: 1) Discovers a variety of CUDA optimization techniques and learns to combine them strategically to achieve optimal performance; 2) Uncovers fundamental principles of CUDA optimization; 3) Identifies non-obvious performance bottlenecks and rejects seemingly beneficial optimizations that harm performance. The capabilities of CUDA-L1 demonstrate that reinforcement learning can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. More importantly, the trained RL model extend the acquired reasoning abilities to new kernels. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources.

CUDA-L1: 対比強化学習によるCUDA最適化の改善

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

要旨

Support