LLaDA2.0-Uni：拡散拡大言語モデルによるマルチモーダル理解と生成の統合

要旨

我々はLLaDA2.0-Uniを提案する。これは、ネイティブに統合されたフレームワーク内でマルチモーダル理解と生成をサポートする統一離散拡散大規模言語モデル（dLLM）である。そのアーキテクチャは、完全意味論的離散トークナイザ、MoEベースのdLLMバックボーン、および拡散デコーダを組み合わせたもの。SigLIP-VQによる連続視覚入力を離散化することで、モデルはバックボーン内でテキストと視覚入力の両方に対するブロックレベルのマスク拡散を可能にし、デコーダは視覚トークンを高精細画像へ再構成する。推論効率は、バックボーンにおけるプレフィックス認識最適化とデコーダでの少数ステップ蒸留により、並列デコードを超えて強化されている。厳選された大規模データと独自設計の多段階訓練パイプラインに支えられ、LLaDA2.0-Uniはマルチモーダル理解では専門的なVLMに匹敵する性能を示しつつ、画像生成と編集においても強力な性能を発揮する。交錯生成と推論へのネイティブサポートは、次世代統一基盤モデルに向けた有望かつ拡張性の高いパラダイムを確立する。コードとモデルはhttps://github.com/inclusionAI/LLaDA2.0-Uniで公開されている。

English

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.