EMMA：您的文本到图像扩散模型可以秘密接受多模态提示

摘要

最近图像生成领域的进展使得可以从文本条件中生成高质量图像。然而，当面对多模态条件，比如文本结合参考外观时，现有方法往往难以有效平衡多个条件，通常会偏向某一模态而忽视其他模态。为了解决这一挑战，我们引入了EMMA，这是一个新颖的图像生成模型，接受多模态提示，基于最先进的文本到图像（T2I）扩散模型ELLA构建。EMMA通过创新的多模态特征连接器设计，无缝地将额外的模态与文本一起整合，通过特殊的注意机制有效地整合文本和补充模态信息，指导图像生成。通过冻结原始T2I扩散模型中的所有参数，仅调整一些额外层，我们发现一个有趣的现象，即预训练的T2I扩散模型可以秘密接受多模态提示。这一有趣的特性有助于轻松适应不同的现有框架，使EMMA成为一个灵活而有效的工具，用于生成个性化和上下文感知的图像甚至视频。此外，我们引入了一种策略，将学习的EMMA模块组装起来，以同时生成基于多个模态的图像，消除了需要使用混合多模态提示进行额外训练的需求。大量实验表明了EMMA在生成图像时保持高保真度和细节的有效性，展示了其作为先进多模态条件图像生成任务的强大解决方案的潜力。

English

Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

EMMA：您的文本到图像扩散模型可以秘密接受多模态提示

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

摘要

Support