DICE：基于扩散模型的大语言模型在生成CUDA内核方面表现卓越

摘要

扩散大语言模型（dLLM）凭借其并行令牌生成能力，已成为自回归（AR）大语言模型的重要替代方案。该范式特别适用于代码生成场景，因为此类任务需要整体结构规划和非顺序优化。尽管潜力显著，但为CUDA内核生成定制dLLM仍面临挑战，这不仅源于技术的高度专业性，更因高质量训练数据的严重匮乏。为应对这些挑战，我们构建了CuKe——一个针对高性能CUDA内核优化的增强型监督微调数据集。在此基础上，我们提出双阶段精选强化学习（BiC-RL）框架，包含CUDA内核填充阶段和端到端CUDA内核生成阶段。基于此训练框架，我们推出了DICE系列扩散大语言模型，专为CUDA内核生成设计，涵盖1.7B、4B和8B三种参数规模。在KernelBench上的大量实验表明，DICE在同等规模下显著优于自回归和扩散大语言模型，为CUDA内核生成确立了新的技术标杆。

English

Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.

DICE：基于扩散模型的大语言模型在生成CUDA内核方面表现卓越

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

摘要

Support