LLaDA2.0-Uni: 확산 대규모 언어 모델을 통한 멀티모달 이해와 생성의 통합

초록

우리는 다중 모드 이해와 생성을 기본적으로 통합된 프레임워크 내에서 지원하는 통합 이산 확산 대규모 언어 모델(dLLM)인 LLaDA2.0-Uni를 제안한다. 해당 아키텍처는 완전 의미론적 이산 토크나이저, MoE 기반 dLLM 백본, 그리고 확산 디코더로 구성된다. SigLIP-VQ를 통해 연속적인 시각 입력을 이산화함으로써, 이 모델은 백본 내에서 텍스트와 시각 입력 모두에 대한 블록 수준 마스크 확산을 가능하게 하며, 디코더는 시각 토큰을 높은 충실도의 이미지로 재구성한다. 추론 효율성은 백본의 프리픽스 인식 최적화와 디코더의 Few-step distillation을 통해 병렬 디코딩을 넘어서 향상되었다. 신중하게 구성된 대규모 데이터와 맞춤형 다단계 학습 파이프라인의 지원으로, LLaDA2.0-Uni는 다중 모드 이해에서는 전문 VLM에 필적하는 성능을 보이는 동시에 이미지 생성 및 편집에서도 강력한 성능을 제공한다. 인터리브 생성 및 추론에 대한 기본 지원은 차세대 통합 기반 모델을 위한 유망하고 확장 가능한 패러다임을 정립한다. 코드와 모델은 https://github.com/inclusionAI/LLaDA2.0-Uni에서 이용 가능하다.

English

We present LLaDA2.0-Uni, a unified discrete diffusion large language model (dLLM) that supports multimodal understanding and generation within a natively integrated framework. Its architecture combines a fully semantic discrete tokenizer, a MoE-based dLLM backbone, and a diffusion decoder. By discretizing continuous visual inputs via SigLIP-VQ, the model enables block-level masked diffusion for both text and vision inputs within the backbone, while the decoder reconstructs visual tokens into high-fidelity images. Inference efficiency is enhanced beyond parallel decoding through prefix-aware optimizations in the backbone and few-step distillation in the decoder. Supported by carefully curated large-scale data and a tailored multi-stage training pipeline, LLaDA2.0-Uni matches specialized VLMs in multimodal understanding while delivering strong performance in image generation and editing. Its native support for interleaved generation and reasoning establishes a promising and scalable paradigm for next-generation unified foundation models. Codes and models are available at https://github.com/inclusionAI/LLaDA2.0-Uni.