DiffuCoder: 코드 생성을 위한 마스크드 디퓨전 모델의 이해와 개선

초록

확산 기반 대형 언어 모델(dLLM)은 디노이징 모델이 전체 시퀀스에 대해 작동한다는 점에서 자기회귀(AR) 모델에 대한 매력적인 대안으로 주목받고 있습니다. dLLM의 전역 계획 및 반복적 개선 기능은 특히 코드 생성에 유용합니다. 그러나 현재 dLLM의 훈련 및 추론 메커니즘은 여전히 충분히 탐구되지 않고 있습니다. dLLM의 디코딩 행동을 명확히 이해하고 코드 생성에서의 잠재력을 발휘하기 위해, 우리는 이들의 디노이징 프로세스와 강화 학습(RL) 방법을 체계적으로 조사합니다. 우리는 130B 토큰의 코드로 7B 규모의 dLLM인 DiffuCoder를 훈련시켰습니다. 이 모델을 테스트베드로 사용하여, 우리는 AR 모델과의 차이점을 분석했습니다: (1) dLLM은 준-AR 디코딩에 의존하지 않고도 생성의 인과성을 결정할 수 있으며, (2) 샘플링 온도를 높이면 토큰 선택뿐만 아니라 생성 순서도 다양화됩니다. 이러한 다양성은 RL 롤아웃을 위한 풍부한 탐색 공간을 만듭니다. RL 훈련을 위해, 토큰 로그-우도 추정치의 분산을 줄이고 훈련 효율성을 유지하기 위해, 우리는 훈련에 사용되는 완성문에 대해 상호 보완적인 마스크 노이즈를 구성하는 새로운 샘플링 기법인 coupled-GRPO를 제안합니다. 우리의 실험에서, coupled-GRPO는 DiffuCoder의 코드 생성 벤치마크 성능을 크게 향상시켰으며(EvalPlus에서 +4.4%), 디코딩 중 AR 인과성에 대한 의존도를 줄였습니다. 우리의 연구는 dLLM 생성의 메커니즘에 대한 깊은 통찰을 제공하며, 확산 모델에 적합한 효과적인 RL 훈련 프레임워크를 제시합니다. https://github.com/apple/ml-diffucoder.

English

Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models because their denoising models operate over the entire sequence. The global planning and iterative refinement features of dLLMs are particularly useful for code generation. However, current training and inference mechanisms for dLLMs in coding are still under-explored. To demystify the decoding behavior of dLLMs and unlock their potential for coding, we systematically investigate their denoising processes and reinforcement learning (RL) methods. We train a 7B dLLM, DiffuCoder, on 130B tokens of code. Using this model as a testbed, we analyze its decoding behavior, revealing how it differs from that of AR models: (1) dLLMs can decide how causal their generation should be without relying on semi-AR decoding, and (2) increasing the sampling temperature diversifies not only token choices but also their generation order. This diversity creates a rich search space for RL rollouts. For RL training, to reduce the variance of token log-likelihood estimates and maintain training efficiency, we propose coupled-GRPO, a novel sampling scheme that constructs complementary mask noise for completions used in training. In our experiments, coupled-GRPO significantly improves DiffuCoder's performance on code generation benchmarks (+4.4\% on EvalPlus) and reduces reliance on AR causal during decoding. Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework. https://github.com/apple/ml-diffucoder.

DiffuCoder: 코드 생성을 위한 마스크드 디퓨전 모델의 이해와 개선

DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation

초록

Support