Omni-Diffusion：基于掩码离散扩散的统一多模态理解与生成框架

摘要

尽管当前多模态大语言模型（MLLMs）取得了显著进展，但其主干网络仍主要采用传统的自回归架构，在探索高效能、高效率的替代架构设计方面仍存巨大空间。与此同时，近期研究已成功将离散扩散模型应用于视觉理解、图像生成等多个领域，揭示了其作为多模态系统主干网络的巨大潜力。受这些前沿研究的启发，我们提出了全向扩散模型（Omni-Diffusion）——首个完全基于掩码式离散扩散模型的任意模态转换多模态语言模型，实现了文本、语音和图像理解与生成的统一框架。该模型采用统一的掩码式离散扩散模型直接学习多模态离散标记的联合分布，不仅支持双模态任务，还能应对更复杂的多模态场景。在多样化基准测试中，我们的方法在处理两种及以上模态的任务上超越或媲美现有多模态系统，彰显了扩散模型驱动下一代多模态基础模型的巨大潜力。项目页面：https://omni-diffusion.github.io。

English

While recent multimodal large language models (MLLMs) have made impressive strides, they predominantly employ a conventional autoregressive architecture as their backbone, leaving significant room to explore effective and efficient alternatives in architectural design. Concurrently, recent studies have successfully applied discrete diffusion models to various domains, such as visual understanding and image generation, revealing their considerable potential as a promising backbone for multimodal systems. Drawing inspiration from these pioneering research, we introduce Omni-Diffusion, the first any-to-any multimodal language model built entirely on mask-based discrete diffusion models, which unifies understanding and generation across text, speech, and images. Omni-Diffusion employs a unified mask-based discrete diffusion model to directly capture the joint distribution over discrete multimodal tokens. This approach supports not only bimodal tasks but also more complex scenarios involving multiple modalities. On a diverse set of benchmarks, our method outperforms or performs on par with existing multimodal systems that process two or more modalities, highlighting the significant promise of diffusion models in powering the next generation of multimodal foundation models. Project webpage: https://omni-diffusion.github.io.

Omni-Diffusion：基于掩码离散扩散的统一多模态理解与生成框架

Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion

摘要

Support