Dr. Kernel: Tritonカーネル生成のための正しい強化学習アプローチ

要旨

高品質なカーネルはスケーラブルなAIシステムにおいて極めて重要であり、大規模言語モデル（LLM）がそのようなコードを生成できるようになれば、AI開発が大きく進展する。しかし、このタスクでLLMを訓練するには十分なデータと堅牢な環境が必要であり、そのプロセスは報酬ハッキングや怠惰な最適化に陥りやすい。これらの場合、モデルは訓練報酬を不正に操作し、意味のある高速化よりも些末な正しさを優先する可能性がある。本論文では、カーネル生成のための強化学習（RL）を体系的に研究する。まず、報酬ハッキング検査、多段階インタラクションからのデータ収集、長期RL訓練をサポートする堅牢な分散GPU環境「KernelGYM」を設計する。KernelGYMを基盤として、効果的な多段階RL手法を調査し、GRPOにおける自己包含が引き起こす偏った方策勾配問題を特定する。この問題を解決するため、多段階RLに対して不偏なアドバンテージ推定を提供する「Turn-level Reinforce-Leave-One-Out（TRLOO）」を提案する。怠惰な最適化を軽減するため、訓練安定性向上に向けたミスマッチ補正を組み込み、問題を克服する「プロファイリングベース報酬（PR）」および「プロファイリングベース棄却サンプリング（PRS）」を導入する。訓練済みモデル「Dr.Kernel-14B」は、KernelbenchにおいてClaude-4.5-Sonnetと競合する性能に到達した。最後に、Dr.Kernel-14Bに対する逐次的なテスト時スケーリングを検討する。KernelBench Level-2サブセットでは、生成されたカーネルの31.6%がTorchリファレンスに対し1.2倍以上の高速化を達成し、Claude-4.5-Sonnet（26.7%）およびGPT-5（28.6%）を上回った。全段階から最良候補を選択した場合、この1.2倍高速化率はさらに47.8%に向上する。環境、訓練コード、モデル、データセットを含む全リソースはhttps://www.github.com/hkust-nlp/KernelGYM で公開されている。

English

High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.

Dr. Kernel: Tritonカーネル生成のための正しい強化学習アプローチ

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

要旨

Support