Stable-DiffCoder：代码扩散大语言模型前沿技术新突破

摘要

基于扩散的语言模型（DLLMs）相较于自回归（AR）模型具有非顺序的块式生成能力和更丰富的数据复用特性，但在同等资源预算下，现有代码DLLMs仍落后于强大的AR基线模型。我们通过一项受控研究重新审视这一设定，提出了Stable-DiffCoder——一种复用Seed-Coder架构、数据及训练流程的块扩散代码模型。为实现高效知识学习和稳定训练，我们引入了块扩散持续预训练（CPT）阶段，并通过定制化的预热策略与块级裁剪噪声调度进行增强。在相同数据和架构下，Stable-DiffCoder在广泛的代码基准测试中整体优于其AR对应模型。此外，仅依靠CPT和监督微调阶段，该模型性能已超越多种约80亿参数的AR与DLLMs，证明基于扩散的训练能单独提升代码建模质量。值得注意的是，扩散式任意顺序建模可增强编辑与推理场景下的结构化代码建模能力，并通过数据扩增惠及低资源编程语言。

English

Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

Stable-DiffCoder：代码扩散大语言模型前沿技术新突破

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

摘要

Support