LLaDA-o：一种高效且长度自适应的全能扩散模型

摘要

我们提出LLaDA-o，一种高效且长度自适应的全能扩散模型，用于多模态理解与生成。该模型基于混合扩散框架构建，通过离散掩码扩散实现文本理解，连续扩散完成视觉生成，并借助共享的轻量化注意力骨干网络将二者耦合，有效减少固定条件下的冗余计算。在混合扩散框架基础上，我们进一步提出以数据为中心的长度自适应策略，无需调整架构即可实现多模态场景下的灵活长度解码。大量实验表明，LLaDA-o在多模态理解与生成基准测试中达到全能扩散模型的领先水平，在文本到图像生成的DPG-Bench基准上取得87.04分，验证了统一化全能扩散建模的有效性。代码已开源：https://github.com/ML-GSAI/LLaDA-o。

English

We present LLaDA-o, an effective and length-adaptive omni diffusion model for multimodal understanding and generation. LLaDA-o is built on a Mixture of Diffusion (MoD) framework that decouples discrete masked diffusion for text understanding and continuous diffusion for visual generation, while coupling them through a shared, simple, and efficient attention backbone that reduces redundant computation for fixed conditions. Building on MoD, we further introduce a data-centric length adaptation strategy that enables flexible-length decoding in multimodal settings without architectural changes. Extensive experiments show that LLaDA-o achieves state-of-the-art performance among omni-diffusion models on multimodal understanding and generation benchmarks, and reaches 87.04 on DPG-Bench for text-to-image generation, supporting the effectiveness of unified omni diffusion modeling. Code is available at https://github.com/ML-GSAI/LLaDA-o.