穩定擴散編碼器：引領程式碼擴散大型語言模型的前沿發展

摘要

基於擴散機制的語言模型（DLLMs）相較自迴歸（AR）模型具有非順序的塊狀生成能力和更豐富的數據復用特性，但在同等算力預算下，現有程式碼DLLMs仍落後於強勁的AR基準模型。我們通過受控實驗重新審視這一設定，提出Stable-DiffCoder——一種沿用Seed-Coder架構、數據與訓練流程的塊擴散程式碼模型。為實現高效的知識學習與穩定訓練，我們引入塊擴散持續預訓練（CPT）階段，並通過定制的熱身策略與塊級噪聲裁剪調度進行增強。在相同數據與架構下，Stable-DiffCoder在廣泛的程式碼基準測試中整體表現優於其AR對應模型。更重要的是，僅依靠CPT與監督微調階段，該模型性能便超越多種約80億參數的AR與DLLMs模型，證明基於擴散的訓練能單獨提升程式碼建模質量。此外，擴散模型的任意順序建模能力可增強結構化程式碼的編輯與推理效果，並通過數據擴增提升低資源程式語言的建模性能。

English

Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

穩定擴散編碼器：引領程式碼擴散大型語言模型的前沿發展

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

摘要

Support