ChatPaper.aiChatPaper

Lumina-DiMOO:面向多模态生成与理解的全能扩散大语言模型

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

October 7, 2025
作者: Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu
cs.AI

摘要

我们推出Lumina-DiMOO,一款开源的基础模型,旨在实现无缝的多模态生成与理解。Lumina-DiMOO区别于以往的统一模型,它采用完全离散的扩散建模技术来处理跨多种模态的输入与输出。这一创新方法使Lumina-DiMOO在采样效率上超越了先前的自回归(AR)或混合AR-扩散范式,并能够灵活支持广泛的多模态任务,包括文本到图像生成、图像到图像生成(如图像编辑、主题驱动生成及图像修复等)以及图像理解。Lumina-DiMOO在多项基准测试中达到了业界领先水平,超越了现有的开源统一多模态模型。为了推动多模态与离散扩散模型研究的进一步发展,我们向社区公开了代码及模型检查点。项目页面:https://synbol.github.io/Lumina-DiMOO。
English
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
PDF472October 9, 2025