柯諾博士：為Triton核心世代實現正確的強化學習方法

摘要

高品質核心程式碼對可擴展AI系統至關重要，而讓大型語言模型具備生成此類程式碼的能力將推動AI發展。然而，為此任務訓練大型語言模型需要充足資料、穩健環境，且訓練過程易受獎勵破解和惰性優化問題影響。在這些情況下，模型可能鑽營訓練獎勵機制，將表面正確性置於實際加速效果之上。本文系統性研究強化學習在核心程式碼生成中的應用：首先設計KernelGYM——支持獎勵破解檢測、多輪互動資料收集與長期強化學習訓練的分散式GPU環境；基於此環境探究有效的多輪強化學習方法，發現GRPO中因自我包含導致的偏誤策略梯度問題，為此提出回合級強化學習-留一法（TRLOO）實現無偏優勢估計；針對惰性優化問題，結合失配校正提升訓練穩定性，並引入基於性能剖析的獎勵機制（PR）與基於剖析的拒絕取樣（PRS）加以克服。最終訓練出的Dr.Kernel-14B模型在KernelBench中達到與Claude-4.5-Sonnet相當的性能。在KernelBench第二級子集中，31.6%生成的核心程式碼相比Torch參考實現實現至少1.2倍加速，優於Claude-4.5-Sonnet（26.7%）和GPT-5（28.6%）；若跨回合擇優選取，該加速比例進一步提升至47.8%。所有環境、訓練程式碼、模型與資料集均已開源於https://www.github.com/hkust-nlp/KernelGYM。

English

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

柯諾博士：為Triton核心世代實現正確的強化學習方法

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

摘要

Support