Lumina-DiMOO：一款面向多模态生成与理解的全域扩散大语言模型

摘要

我们推出Lumina-DiMOO，一个开源的基础模型，旨在实现无缝的多模态生成与理解。Lumina-DiMOO通过采用完全离散的扩散建模技术处理多种模态的输入与输出，从而与先前的统一模型区分开来。这一创新方法使Lumina-DiMOO在采样效率上超越了以往的自回归（AR）或混合AR-扩散范式，并能够灵活支持广泛的多模态任务，包括文本到图像生成、图像到图像生成（如图像编辑、主题驱动生成及图像修复等），以及图像理解。Lumina-DiMOO在多个基准测试中达到了最先进的性能，超越了现有的开源统一多模态模型。为了促进多模态与离散扩散模型研究的进一步发展，我们向社区公开了代码与检查点。项目页面：https://synbol.github.io/Lumina-DiMOO。

English

We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

Lumina-DiMOO：一款面向多模态生成与理解的全域扩散大语言模型

Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

摘要

Support