ChatPaper.aiChatPaper

Lavida-O:弹性大规模掩码扩散模型,实现统一的多模态理解与生成

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

September 23, 2025
作者: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
cs.AI

摘要

我们提出了Lavida-O,一个统一的掩码扩散模型(MDM),用于多模态理解与生成。与现有的多模态MDM如MMaDa和Muddit仅支持简单的图像级理解任务及低分辨率图像生成不同,Lavida-O提供了一个单一框架,能够实现图像级理解、对象定位、图像编辑以及高分辨率(1024像素)文本到图像合成。Lavida-O引入了一种新颖的弹性混合变换器(Elastic-MoT)架构,该架构将轻量级的生成分支与更大的理解分支相结合,并通过令牌压缩、通用文本条件化和分层采样来支持高效且高质量的生成。此外,Lavida-O在图像生成和编辑任务中融入了规划与迭代自我反思机制,无缝地利用其理解能力提升生成质量。Lavida-O在包括RefCOCO对象定位、GenEval文本到图像生成和ImgEdit图像编辑在内的广泛基准测试中均达到了最先进的性能,超越了现有的自回归模型和连续扩散模型如Qwen2.5-VL和FluxKontext-dev,同时在推理时提供了显著的加速。这些进展确立了Lavida-O作为可扩展多模态推理与生成的新范式。
English
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.
PDF94September 25, 2025