LLaDA-2.0-Uni:基於擴散大型語言模型的統一多模態理解與生成框架
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
April 22, 2026
作者: Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Tao Lin, Qi Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Yi Xin, Junbo Zhao
cs.AI
摘要
我們提出LLaDA2.0-Uni——一種支援多模態理解與生成的統一離散擴散大語言模型(dLLM),其架構整合了全語義離散標記器、基於混合專家(MoE)的dLLM主幹網絡以及擴散解碼器。透過SigLIP-VQ技術將連續視覺輸入離散化,該模型能在主幹網絡中對文本與視覺輸入實施區塊級掩碼擴散,同時由解碼器將視覺標記重建為高保真度圖像。藉由主幹網絡中的前綴感知優化與解碼器的少步蒸餾技術,推理效率獲得了超越平行解碼的增強。在精心策劃的大規模數據與定制化多階段訓練流程支持下,LLaDA2.0-Uni在多模態理解任務上媲美專業視覺語言模型,同時在圖像生成與編輯方面表現卓越。其對交錯生成與推理的原生支援,為新一代統一基礎模型建立了可擴展的創新範式。程式碼與模型已開源於:https://github.com/inclusionAI/LLaDA2.0-Uni。
English
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.