MMCORE：基于表征对齐潜在嵌入的多模态连接

摘要

我们提出了MMCORE——一个为多模态图像生成与编辑设计的统一框架。该框架通过可学习的查询标记，利用预训练视觉语言模型预测语义视觉嵌入，随后将这些嵌入作为扩散模型的调节信号。这种流线型设计能有效将VLM丰富的理解与推理能力迁移至视觉生成过程。通过避免自回归模型与扩散模型间的深度融合或从头训练，MMCORE在保持高保真合成质量的同时显著降低了计算成本。 MMCORE无缝整合了文生图合成与交错式图像生成功能，在空间推理、视觉定位等复杂场景中展现出强大的多模态理解能力。综合评估表明，在广泛的文生图及单图/多图编辑基准测试中，MMCORE始终优于当前最先进的基线模型。

English

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently serve as conditioning signals for a diffusion model. This streamlined design effectively transfers the rich understanding and reasoning capabilities of VLMs into the visual generation process. By obviating the need for deep fusion between autoregressive and diffusion models or training from scratch, MMCORE significantly reduces computational overhead while maintaining high-fidelity synthesis. MMCORE seamlessly integrates text-to-image synthesis with interleaved image generation, demonstrating robust multimodal comprehension in complex scenarios such as spatial reasoning and visual grounding. Comprehensive evaluations indicate that MMCORE consistently outperforms state-of-the-art baselines across a broad spectrum of text-to-image and single/multi-image editing benchmarks.

MMCORE：基于表征对齐潜在嵌入的多模态连接

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

摘要

Support