ChatPaper.aiChatPaper

EMMA:您的文本到图像扩散模型可以秘密接受多模态提示

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

June 13, 2024
作者: Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang
cs.AI

摘要

最近图像生成领域的进展使得可以从文本条件中生成高质量图像。然而,当面对多模态条件,比如文本结合参考外观时,现有方法往往难以有效平衡多个条件,通常会偏向某一模态而忽视其他模态。为了解决这一挑战,我们引入了EMMA,这是一个新颖的图像生成模型,接受多模态提示,基于最先进的文本到图像(T2I)扩散模型ELLA构建。EMMA通过创新的多模态特征连接器设计,无缝地将额外的模态与文本一起整合,通过特殊的注意机制有效地整合文本和补充模态信息,指导图像生成。通过冻结原始T2I扩散模型中的所有参数,仅调整一些额外层,我们发现一个有趣的现象,即预训练的T2I扩散模型可以秘密接受多模态提示。这一有趣的特性有助于轻松适应不同的现有框架,使EMMA成为一个灵活而有效的工具,用于生成个性化和上下文感知的图像甚至视频。此外,我们引入了一种策略,将学习的EMMA模块组装起来,以同时生成基于多个模态的图像,消除了需要使用混合多模态提示进行额外训练的需求。大量实验表明了EMMA在生成图像时保持高保真度和细节的有效性,展示了其作为先进多模态条件图像生成任务的强大解决方案的潜力。
English
Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.
PDF143December 6, 2024