LLaDA-2.0-Uni：基於擴散大型語言模型的統一多模態理解與生成框架

摘要

我們提出LLaDA2.0-Uni——一種支援多模態理解與生成的統一離散擴散大語言模型（dLLM），其架構整合了全語義離散標記器、基於混合專家（MoE）的dLLM主幹網絡以及擴散解碼器。透過SigLIP-VQ技術將連續視覺輸入離散化，該模型能在主幹網絡中對文本與視覺輸入實施區塊級掩碼擴散，同時由解碼器將視覺標記重建為高保真度圖像。藉由主幹網絡中的前綴感知優化與解碼器的少步蒸餾技術，推理效率獲得了超越平行解碼的增強。在精心策劃的大規模數據與定制化多階段訓練流程支持下，LLaDA2.0-Uni在多模態理解任務上媲美專業視覺語言模型，同時在圖像生成與編輯方面表現卓越。其對交錯生成與推理的原生支援，為新一代統一基礎模型建立了可擴展的創新範式。程式碼與模型已開源於：https://github.com/inclusionAI/LLaDA2.0-Uni。

English

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.