Stable-DiffCoder：コード拡散大規模言語モデルのフロンティアを押し広げる

要旨

拡散ベースの言語モデル（DLLM）は、自己回帰（AR）モデルと比較して非逐次的なブロック単位の生成と豊富なデータ再利用を可能にするが、既存のコードDLLMは同等の計算予算下では強力なARベースラインに依然として遅れを取っている。本研究では制御された環境でこの設定を再検討し、Seed-Coderのアーキテクチャ・データ・訓練パイプラインを再利用するブロック拡散コードモデルStable-DiffCoderを提案する。効率的な知識学習と安定した訓練を実現するため、調整済みウォームアップとブロック単位クリップノイズスケジュールを強化したブロック拡散継続事前学習（CPT）段階を組み込んだ。同一データ・アーキテクチャ条件下で、Stable-DiffCoderは広範なコードベンチマークにおいてARモデルを総合的に上回る性能を示した。さらにCPTと教師ありファインチューニングのみに依存しつつ、Stable-DiffCoderは多様な～80億パラメータのARモデルおよびDLLMを凌駕する性能を達成し、拡散ベースの訓練が単独のAR訓練を超えるコードモデリング品質の向上をもたらすことを実証した。加えて、拡散ベースの任意順序モデリングは編集・推論における構造化コードモデリングを改善し、データ拡張を通じて低リソースプログラミング言語の性能向上に寄与する。

English

Diffusion-based language models (DLLMs) offer non-sequential, block-wise generation and richer data reuse compared to autoregressive (AR) models, but existing code DLLMs still lag behind strong AR baselines under comparable budgets. We revisit this setting in a controlled study and introduce Stable-DiffCoder, a block diffusion code model that reuses the Seed-Coder architecture, data, and training pipeline. To enable efficient knowledge learning and stable training, we incorporate a block diffusion continual pretraining (CPT) stage enhanced by a tailored warmup and block-wise clipped noise schedule. Under the same data and architecture, Stable-DiffCoder overall outperforms its AR counterpart on a broad suite of code benchmarks. Moreover, relying only on the CPT and supervised fine-tuning stages, Stable-DiffCoder achieves stronger performance than a wide range of \~8B ARs and DLLMs, demonstrating that diffusion-based training can improve code modeling quality beyond AR training alone. Moreover, diffusion-based any-order modeling improves structured code modeling for editing and reasoning, and through data augmentation, benefits low-resource coding languages.

Stable-DiffCoder：コード拡散大規模言語モデルのフロンティアを押し広げる

Stable-DiffCoder: Pushing the Frontier of Code Diffusion Large Language Model

要旨

Support