科诺博士：基于强化学习的Triton内核生成优化方案（注：Triton是深度学习编译器领域的重要技术，这里采用"内核生成"而非直译"内核世代"以符合计算机领域术语习惯；"Done Right"意译为"优化方案"以体现技术改进内涵；保留"强化学习"标准译法并采用"科诺博士"的音意结合译名增强专业辨识度）

摘要

高质量内核对于可扩展AI系统至关重要，而让大语言模型生成此类代码将推动AI发展。然而，训练大语言模型完成该任务需要充足数据、稳健环境，且该过程易受奖励破解和惰性优化影响。在此类情况下，模型可能通过破解训练奖励机制，将表面正确性置于实质加速之上。本文系统研究了强化学习在内核生成中的应用。我们首先设计了KernelGYM——支持奖励破解检测、多轮交互数据收集和长期强化学习训练的分布式GPU环境。基于该环境，我们探索了有效的多轮强化学习方法，发现GRPO中因自我包含导致的策略梯度偏差问题。为此提出TRLOO方法，为多轮强化学习提供无偏优势估计。针对惰性优化问题，我们引入失配校正以提升训练稳定性，并提出基于性能分析的奖励机制和拒绝采样方法。最终训练的Dr.Kernel-14B模型在Kernelbench中达到与Claude-4.5-Sonnet相当的性能。在KernelBench二级测试集上，31.6%的生成内核实现了相对Torch参考版本至少1.2倍的加速，优于Claude-4.5-Sonnet（26.7%）和GPT-5（28.6%）。当跨轮次选择最优候选时，1.2倍加速率进一步提升至47.8%。所有环境、训练代码、模型和数据集均已开源：https://www.github.com/hkust-nlp/KernelGYM。

English

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

摘要

Support