Lavida-O：統一的なマルチモーダル理解と生成のための弾力的な大規模マスク拡散モデル

要旨

我々は、マルチモーダル理解と生成のための統一されたMasked Diffusion Model（MDM）であるLavida-Oを提案する。MMaDaやMudditなどの既存のマルチモーダルMDMが単純な画像レベルの理解タスクや低解像度の画像生成しかサポートしていないのに対し、Lavida-Oは、画像レベルの理解、オブジェクトグラウンディング、画像編集、および高解像度（1024px）のテキストから画像への合成を可能にする単一のフレームワークを提供する。Lavida-Oは、軽量な生成ブランチと大規模な理解ブランチを結合する新規のElastic Mixture-of-Transformers（Elastic-MoT）アーキテクチャを採用しており、トークン圧縮、ユニバーサルテキスト条件付け、および層別サンプリングをサポートすることで、効率的で高品質な生成を実現する。さらに、Lavida-Oは、画像生成および編集タスクにおいて計画と反復的な自己反映を組み込み、その理解能力をシームレスに活用して生成品質を向上させる。Lavida-Oは、RefCOCOオブジェクトグラウンディング、GenEvalテキストから画像への生成、ImgEdit画像編集など、幅広いベンチマークにおいて最先端の性能を達成し、Qwen2.5-VLやFluxKontext-devなどの既存の自己回帰モデルや連続拡散モデルを凌駕しつつ、推論時の大幅な高速化を実現する。これらの進歩により、Lavida-Oはスケーラブルなマルチモーダル推論と生成の新たなパラダイムとして確立される。

English

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

Lavida-O：統一的なマルチモーダル理解と生成のための弾力的な大規模マスク拡散モデル

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

要旨

Support