ChatPaper.aiChatPaper

LLaDA2.0:將擴散語言模型規模擴展至千億參數

LLaDA2.0: Scaling Up Diffusion Language Models to 100B

December 10, 2025
作者: Tiwei Bie, Maosong Cao, Kun Chen, Lun Du, Mingliang Gong, Zhuochen Gong, Yanmei Gu, Jiaqi Hu, Zenan Huang, Zhenzhong Lan, Chengxi Li, Chongxuan Li, Jianguo Li, Zehuan Li, Huabin Liu, Ling Liu, Guoshan Lu, Xiaocheng Lu, Yuxin Ma, Jianfeng Tan, Lanning Wei, Ji-Rong Wen, Yipeng Xing, Xiaolu Zhang, Junbo Zhao, Da Zheng, Jun Zhou, Junlin Zhou, Zhanchao Zhou, Liwang Zhu, Yihong Zhuang
cs.AI

摘要

本文提出LLaDA2.0——通過從自回歸模型進行系統化轉換構建的離散擴散大型語言模型元組,其總參數規模達1000億,為前沿規模部署建立了新範式。該方法摒棄成本高昂的從頭訓練,秉持知識繼承、漸進適應與效率優先的設計原則,通過創新的三階段基於塊級WSD的訓練方案(包含塊擴散中逐步增大塊尺寸的預熱階段、大規模全序列擴散的穩定階段、以及回歸緊湊塊擴散的衰減階段),實現預訓練AR模型向dLLM的無縫轉換。結合SFT和DPO的訓練後對齊,我們得到LLaDA2.0-mini(160億參數)與LLaDA2.0-flash(1000億參數)兩個經過指令調優的混合專家模型變體,專為實際部署優化。這些模型在保留並行解碼優勢的同時,於前沿規模實現了卓越的性能與效率。兩款模型均已開源。
English
This paper presents LLaDA2.0 -- a tuple of discrete diffusion large language models (dLLM) scaling up to 100B total parameters through systematic conversion from auto-regressive (AR) models -- establishing a new paradigm for frontier-scale deployment. Instead of costly training from scratch, LLaDA2.0 upholds knowledge inheritance, progressive adaption and efficiency-aware design principle, and seamless converts a pre-trained AR model into dLLM with a novel 3-phase block-level WSD based training scheme: progressive increasing block-size in block diffusion (warm-up), large-scale full-sequence diffusion (stable) and reverting back to compact-size block diffusion (decay). Along with post-training alignment with SFT and DPO, we obtain LLaDA2.0-mini (16B) and LLaDA2.0-flash (100B), two instruction-tuned Mixture-of-Experts (MoE) variants optimized for practical deployment. By preserving the advantages of parallel decoding, these models deliver superior performance and efficiency at the frontier scale. Both models were open-sourced.
PDF552December 20, 2025