CoDA:基于扩散适应的编码语言模型
CoDA: Coding LM via Diffusion Adaptation
September 27, 2025
作者: Haolin Chen, Shiyu Wang, Can Qin, Bo Pang, Zuxin Liu, Jielin Qiu, Jianguo Zhang, Yingbo Zhou, Zeyuan Chen, Ran Xu, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao
cs.AI
摘要
扩散语言模型有望实现自回归编码器所缺乏的双向上下文和填充能力,但实际系统仍显笨重。我们推出了CoDA,这是一个在TPU上训练的1.7B参数扩散编码器,拥有完全开源的训练流程。CoDA结合了大规模扩散预训练与以代码为中心的中期训练及指令微调,实现了置信度引导的采样,使推理延迟保持竞争力。在Humaneval、MBPP和EvalPlus基准测试中,CoDA-1.7B-Instruct的表现与高达7B参数的扩散模型相当甚至更优。我们的发布内容包括模型检查点、评估框架及TPU训练流程,旨在加速基于轻量级扩散的编码助手研究。
English
Diffusion language models promise bidirectional context and infilling
capabilities that autoregressive coders lack, yet practical systems remain
heavyweight. We introduce CoDA, a 1.7B-parameter diffusion coder trained on TPU
with a fully open-source training pipeline. CoDA pairs large-scale diffusion
pre-training with code-centric mid-training and instruction tuning, enabling
confidence-guided sampling that keeps inference latency competitive. On
Humaneval, MBPP, and EvalPlus, CoDA-1.7B-Instruct matches or surpasses
diffusion models up to 7B parameters. Our release includes model checkpoints,
evaluation harnesses, and TPU training pipelines to accelerate research on
lightweight diffusion-based coding assistants.