LLaDA2.0:将扩散语言模型规模扩展至千亿参数
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
December 10, 2025
作者: Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang
cs.AI
摘要
本文提出LLaDA2.0——通过从自回归模型进行系统性转换构建的离散扩散大语言模型元组,总参数量最高达1000亿,为前沿规模部署建立了新范式。该方法摒弃成本高昂的从零训练,秉持知识继承、渐进适应与效率优先的设计原则,通过新颖的三阶段基于块级WSD的训练方案(包含块扩散中逐步增大块尺寸的预热阶段、大规模全序列扩散的稳定阶段、回归紧凑块尺寸扩散的衰减阶段),将预训练自回归模型无缝转换为离散扩散模型。结合基于SFT和DPO的后训练对齐,我们得到LLaDA2.0-mini(160亿参数)和LLaDA2.0-flash(1000亿参数)这两个针对实际部署优化的指令调优混合专家模型变体。通过保留并行解码优势,这些模型在前沿规模上实现了卓越的性能与效率。两个模型均已开源。
English
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.