AutoTriton:基于大语言模型与强化学习的自动Triton编程
AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
July 8, 2025
作者: Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun
cs.AI
摘要
深度学习中的内核开发需要在硬件层面优化计算单元,同时通过大量实验调优来平衡内存管理、并行性以及硬件特定的优化。尽管像Triton这样的领域特定语言通过抽象底层细节简化了GPU编程,但开发者仍需通过迭代实验手动调整关键参数,如分块大小和内存访问模式,这为达到最优性能和广泛采用设置了显著障碍。在本研究中,我们推出了AutoTriton,这是首个基于强化学习(RL)专为Triton编程设计的模型。AutoTriton通过高质量数据收集管道进行监督微调(SFT),掌握Triton编程的核心技能,并采用分组相对策略优化(GRPO)算法进行RL训练,结合基于规则的奖励和基于执行的奖励,逐步提升Triton编程能力。在TritonBench和KernelBench的五个评估通道上的实验表明,我们的8B模型AutoTriton在性能上可与Claude-4-Sonnet和DeepSeek-R1-0528等主流大模型相媲美。进一步的实验分析揭示了AutoTriton内部各模块的关键作用,包括SFT阶段、RL阶段及奖励设计策略。这些发现凸显了RL在自动生成高性能内核方面的潜力,而高性能内核是AI系统的核心组件,这一突破为构建更高效的AI系统奠定了重要基础。模型与代码将发布于https://github.com/AI9Stars/AutoTriton。
English
Kernel development in deep learning requires optimizing computational units
across hardware while balancing memory management, parallelism, and
hardware-specific optimizations through extensive empirical tuning. Although
domain-specific languages like Triton simplify GPU programming by abstracting
low-level details, developers must still manually tune critical parameters such
as tile sizes and memory access patterns through iterative experimentation,
creating substantial barriers to optimal performance and wider adoption. In
this work, we introduce AutoTriton, the first model dedicated to Triton
programming powered by reinforcement learning (RL). AutoTriton performs
supervised fine-tuning (SFT) to be equipped with essential Triton programming
expertise using a high-quality data gathering pipeline, and conducts RL with
Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based
reward and an execution-based reward to further improve Triton programming
ability, sequentially. Experiments across five evaluation channels of
TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves
performance comparable to mainstream large models, including Claude-4-Sonnet
and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial
role of each module within AutoTriton, including the SFT stage, the RL stage,
and the reward design strategy. These findings underscore the promise of RL for
automatically generating high-performance kernels, and since high-performance
kernels are core components of AI systems, this breakthrough establishes an
important foundation for building more efficient AI systems. The model and code
will be available at https://github.com/AI9Stars/AutoTriton.