MMCORE：基于表征对齐潜在嵌入的多模态连接

摘要

我们提出MMCORE——一个面向多模态图像生成与编辑的统一框架。该框架利用预训练视觉语言模型，通过可学习的查询令牌预测语义视觉嵌入，进而将其作为扩散模型的调节信号。这种流线型设计有效转化了VLM丰富的理解与推理能力至视觉生成过程。通过避免自回归模型与扩散模型间的深度融合或从头训练，MMCORE在保持高保真合成质量的同时显著降低了计算开销。 MMCORE无缝整合了文本到图像合成与交错式图像生成功能，在空间推理、视觉定位等复杂场景中展现出强大的多模态理解能力。综合评估表明，该框架在文本到图像生成及单图/多图编辑的广泛基准测试中持续超越现有最优基线模型。

English

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

MMCORE：基于表征对齐潜在嵌入的多模态连接

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

摘要

Support