ChatPaper.aiChatPaper

Lumina-DiMOO:一款面向多模态生成与理解的全域扩散大语言模型

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

October 7, 2025
作者: Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu
cs.AI

摘要

我们推出Lumina-DiMOO,一个开源的基础模型,旨在实现无缝的多模态生成与理解。Lumina-DiMOO通过采用完全离散的扩散建模技术处理多种模态的输入与输出,从而与先前的统一模型区分开来。这一创新方法使Lumina-DiMOO在采样效率上超越了以往的自回归(AR)或混合AR-扩散范式,并能够灵活支持广泛的多模态任务,包括文本到图像生成、图像到图像生成(如图像编辑、主题驱动生成及图像修复等),以及图像理解。Lumina-DiMOO在多个基准测试中达到了最先进的性能,超越了现有的开源统一多模态模型。为了促进多模态与离散扩散模型研究的进一步发展,我们向社区公开了代码与检查点。项目页面:https://synbol.github.io/Lumina-DiMOO。
English
We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.
PDF472October 9, 2025