LLaDA2.0-Uni：基于扩散大语言模型的多模态理解与生成统一框架

摘要

我们推出LLaDA2.0-Uni——一种支持多模态理解与生成的统一离散扩散大语言模型（dLLM），其架构原生集成于统一框架中。该模型融合了全语义离散分词器、基于混合专家（MoE）的dLLM主干网络以及扩散解码器。通过SigLIP-VQ对连续视觉输入进行离散化处理，模型可在主干网络中实现文本与视觉输入的区块级掩码扩散，同时解码器能将视觉标记重建为高保真图像。借助主干网络中的前缀感知优化和解码器的少步蒸馏技术，推理效率显著超越了传统并行解码方法。在精心构建的大规模数据集和定制化多阶段训练流程的支持下，LLaDA2.0-Uni在多模态理解任务上媲美专业视觉语言模型，同时在图像生成与编辑方面展现出强劲性能。其原生支持交错生成与推理的能力，为新一代统一基础模型建立了可扩展的创新范式。代码与模型已开源：https://github.com/inclusionAI/LLaDA2.0-Uni。

English

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.