DiffuCoder:理解與改進用於代碼生成的掩碼擴散模型
DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation
June 25, 2025
作者: Shansan Gong, Ruixiang Zhang, Huangjie Zheng, Jiatao Gu, Navdeep Jaitly, Lingpeng Kong, Yizhe Zhang
cs.AI
摘要
擴散式大型語言模型(dLLMs)作為自迴歸(AR)模型的強有力替代方案,因其去噪模型作用於整個序列而備受關注。dLLMs的全局規劃與迭代優化特性在代碼生成領域尤為有用。然而,當前針對編碼任務的dLLMs訓練與推理機制仍未被充分探索。為揭示dLLMs的解碼行為並釋放其在編碼中的潛力,我們系統地研究了其去噪過程與強化學習(RL)方法。我們在130B代碼標記上訓練了一個7B參數的dLLM,名為DiffuCoder。以此模型為測試平臺,我們分析了其解碼行為,發現其與AR模型的顯著差異:(1)dLLMs能夠在不依賴半自迴歸解碼的情況下決定生成的因果性程度;(2)提高採樣溫度不僅能多樣化標記選擇,還能改變其生成順序。這種多樣性為RL的rollouts創造了豐富的搜索空間。針對RL訓練,為降低標記對數似然估計的方差並保持訓練效率,我們提出了coupled-GRPO,一種新穎的採樣方案,它為訓練中使用的補全構建互補的掩碼噪聲。在實驗中,coupled-GRPO顯著提升了DiffuCoder在代碼生成基準上的表現(EvalPlus上提升+4.4%),並減少了對AR因果性的依賴。我們的工作深入洞察了dLLM生成的機制,並提供了一個有效的、專為擴散模型設計的RL訓練框架。https://github.com/apple/ml-diffucoder。
English
Diffusion large language models (dLLMs) are compelling alternatives to
autoregressive (AR) models because their denoising models operate over the
entire sequence. The global planning and iterative refinement features of dLLMs
are particularly useful for code generation. However, current training and
inference mechanisms for dLLMs in coding are still under-explored. To demystify
the decoding behavior of dLLMs and unlock their potential for coding, we
systematically investigate their denoising processes and reinforcement learning
(RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code.
Using this model as a testbed, we analyze its decoding behavior, revealing how
it differs from that of AR models: (1) dLLMs can decide how causal their
generation should be without relying on semi-AR decoding, and (2) increasing
the sampling temperature diversifies not only token choices but also their
generation order. This diversity creates a rich search space for RL rollouts.
For RL training, to reduce the variance of token log-likelihood estimates and
maintain training efficiency, we propose coupled-GRPO, a novel
sampling scheme that constructs complementary mask noise for completions used
in training. In our experiments, coupled-GRPO significantly improves
DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and
reduces reliance on AR causal during decoding. Our work provides deeper insight
into the machinery of dLLM generation and offers an effective, diffusion-native
RL training framework. https://github.com/apple/ml-diffucoder.