DICE: 拡散型大規模言語モデルによる高性能CUDAカーネル生成

要旨

拡散大規模言語モデル（dLLM）は、トークンの並列生成が可能であることから、自己回帰（AR）LLMの有力な代替として登場した。このパラダイムは、コード生成において特に有効である。なぜなら、コード生成では構造全体の計画性や非逐次的な修正が重要となるからだ。しかしながら、この可能性にもかかわらず、dLLMをCUDAカーネル生成に特化させることは依然として課題が多い。これは、高度な専門性が要求されることに加え、高品質な訓練データが極度に不足していることが主な原因である。これらの課題に対処するため、我々は高性能CUDAカーネルに最適化された拡張教師ありファインチューニングデータセット「CuKe」を構築した。さらに、CUDAカーネル穴埋め段階とエンドツーエンドのCUDAカーネル生成段階からなる、二段階選別強化学習（BiC-RL）フレームワークを提案する。この訓練フレームワークを活用し、我々はCUDAカーネル生成向けに設計された拡散大規模言語モデルシリーズ「DICE」を開発した。DICEは1.7B、4B、8Bという3つのパラメータ規模を有する。KernelBenchを用いた大規模な実験により、DICEは同等規模の自己回帰LLMおよび拡散LLMの両方を大きく上回り、CUDAカーネル生成において新たなstate-of-the-artを確立することを実証した。

English

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

DICE: 拡散型大規模言語モデルによる高性能CUDAカーネル生成

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

要旨

Support