LLaDA2.0-Uni:基于扩散大语言模型的多模态理解与生成统一框架
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
April 22, 2026
作者: Inclusion AI, Tiwei Bie, Haoxing Chen, Tieyuan Chen, Zhenglin Cheng, Long Cui, Kai Gan, Zhicheng Huang, Zhenzhong Lan, Haoquan Li, Jianguo Li, Tao Lin, Qi Qin, Hongjun Wang, Xiaomei Wang, Haoyuan Wu, Yi Xin, Junbo Zhao
cs.AI
摘要
我们推出LLaDA2.0-Uni——一种支持多模态理解与生成的统一离散扩散大语言模型(dLLM),其架构原生集成于统一框架中。该模型融合了全语义离散分词器、基于混合专家(MoE)的dLLM主干网络以及扩散解码器。通过SigLIP-VQ对连续视觉输入进行离散化处理,模型可在主干网络中实现文本与视觉输入的区块级掩码扩散,同时解码器能将视觉标记重建为高保真图像。借助主干网络中的前缀感知优化和解码器的少步蒸馏技术,推理效率显著超越了传统并行解码方法。在精心构建的大规模数据集和定制化多阶段训练流程的支持下,LLaDA2.0-Uni在多模态理解任务上媲美专业视觉语言模型,同时在图像生成与编辑方面展现出强劲性能。其原生支持交错生成与推理的能力,为新一代统一基础模型建立了可扩展的创新范式。代码与模型已开源:https://github.com/inclusionAI/LLaDA2.0-Uni。
English
We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.