Lavida-O:面向統一多模態理解與生成的彈性大規模掩碼擴散模型
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation
September 23, 2025
作者: Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen
cs.AI
摘要
我們提出Lavida-O,這是一個統一的遮罩擴散模型(MDM),用於多模態理解與生成。與現有的多模態MDM如MMaDa和Muddit僅支持簡單的圖像級理解任務及低分辨率圖像生成不同,Lavida-O提供了一個單一框架,能夠實現圖像級理解、物體定位、圖像編輯以及高分辨率(1024像素)文本到圖像的合成。Lavida-O採用了新穎的彈性變換器混合架構(Elastic-MoT),該架構將輕量級生成分支與更大的理解分支相結合,並通過令牌壓縮、通用文本條件化及分層採樣來支持高效且高質量的生成。此外,Lavida-O在圖像生成和編輯任務中引入了規劃與迭代自我反思,無縫地利用其理解能力提升生成質量。Lavida-O在包括RefCOCO物體定位、GenEval文本到圖像生成及ImgEdit圖像編輯在內的廣泛基準測試中達到了最先進的性能,超越了現有的自回歸模型和連續擴散模型如Qwen2.5-VL和FluxKontext-dev,同時在推理時提供了顯著的加速。這些進展確立了Lavida-O作為可擴展多模態推理與生成的新範式。
English
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal
understanding and generation. Unlike existing multimodal MDMs such as MMaDa and
Muddit which only support simple image-level understanding tasks and
low-resolution image generation, Lavida-O presents a single framework that
enables image-level understanding, object grounding, image editing, and
high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel
Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a
lightweight generation branch with a larger understanding branch, supported by
token compression, universal text conditioning and stratified sampling for
efficient and high-quality generation. Lavida-O further incorporates planning
and iterative self-reflection in image generation and editing tasks, seamlessly
boosting generation quality with its understanding capabilities. Lavida-O
achieves state-of-the-art performance on a wide range of benchmarks including
RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image
editing, outperforming existing autoregressive models and continuous diffusion
models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable
speedup at inference. These advances establish Lavida-O as a new paradigm for
scalable multimodal reasoning and generation.