DiffuCoder: コード生成のためのマスク拡散モデルの理解と改善

要旨

拡散型大規模言語モデル（dLLM）は、そのノイズ除去モデルがシーケンス全体に対して作用するため、自己回帰（AR）モデルに対する有力な代替手段として注目されています。dLLMのグローバルな計画性と反復的な精緻化機能は、特にコード生成において有用です。しかし、コーディングにおけるdLLMの現在の学習と推論メカニズムはまだ十分に探求されていません。dLLMのデコード動作を解明し、コーディングにおけるその潜在能力を引き出すために、我々はそのノイズ除去プロセスと強化学習（RL）手法を体系的に調査します。我々は130Bトークンのコードで7BのdLLM、DiffuCoderを学習させました。このモデルをテストベッドとして使用し、そのデコード動作を分析することで、ARモデルとの違いを明らかにしました：（1）dLLMは、半ARデコードに依存せずに、生成の因果性の度合いを決定できる、（2）サンプリング温度を上げることで、トークンの選択だけでなく、その生成順序も多様化する。この多様性は、RLロールアウトのための豊富な探索空間を創出します。RL学習において、トークンの対数尤度推定の分散を減らし、学習効率を維持するために、我々は補完的なマスクノイズを構築する新しいサンプリングスキームであるcoupled-GRPOを提案します。我々の実験では、coupled-GRPOはDiffuCoderのコード生成ベンチマークにおける性能を大幅に向上させ（EvalPlusで+4.4%）、デコード中のAR因果性への依存を減少させました。我々の研究は、dLLM生成のメカニズムに対する深い洞察を提供し、効果的で拡散ネイティブなRL学習フレームワークを提供します。https://github.com/apple/ml-diffucoder。

English

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose coupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.

DiffuCoder: コード生成のためのマスク拡散モデルの理解と改善

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

要旨

Support