EMMA:您的文本到图像扩散模型可以秘密接受多模态提示
EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts
June 13, 2024
作者: Yucheng Han, Rui Wang, Chi Zhang, Juntao Hu, Pei Cheng, Bin Fu, Hanwang Zhang
cs.AI
摘要
最近图像生成领域的进展使得可以从文本条件中生成高质量图像。然而,当面对多模态条件,比如文本结合参考外观时,现有方法往往难以有效平衡多个条件,通常会偏向某一模态而忽视其他模态。为了解决这一挑战,我们引入了EMMA,这是一个新颖的图像生成模型,接受多模态提示,基于最先进的文本到图像(T2I)扩散模型ELLA构建。EMMA通过创新的多模态特征连接器设计,无缝地将额外的模态与文本一起整合,通过特殊的注意机制有效地整合文本和补充模态信息,指导图像生成。通过冻结原始T2I扩散模型中的所有参数,仅调整一些额外层,我们发现一个有趣的现象,即预训练的T2I扩散模型可以秘密接受多模态提示。这一有趣的特性有助于轻松适应不同的现有框架,使EMMA成为一个灵活而有效的工具,用于生成个性化和上下文感知的图像甚至视频。此外,我们引入了一种策略,将学习的EMMA模块组装起来,以同时生成基于多个模态的图像,消除了需要使用混合多模态提示进行额外训练的需求。大量实验表明了EMMA在生成图像时保持高保真度和细节的有效性,展示了其作为先进多模态条件图像生成任务的强大解决方案的潜力。
English
Recent advancements in image generation have enabled the creation of
high-quality images from text conditions. However, when facing multi-modal
conditions, such as text combined with reference appearances, existing methods
struggle to balance multiple conditions effectively, typically showing a
preference for one modality over others. To address this challenge, we
introduce EMMA, a novel image generation model accepting multi-modal prompts
built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA
seamlessly incorporates additional modalities alongside text to guide image
generation through an innovative Multi-modal Feature Connector design, which
effectively integrates textual and supplementary modal information using a
special attention mechanism. By freezing all parameters in the original T2I
diffusion model and only adjusting some additional layers, we reveal an
interesting finding that the pre-trained T2I diffusion model can secretly
accept multi-modal prompts. This interesting property facilitates easy
adaptation to different existing frameworks, making EMMA a flexible and
effective tool for producing personalized and context-aware images and even
videos. Additionally, we introduce a strategy to assemble learned EMMA modules
to produce images conditioned on multiple modalities simultaneously,
eliminating the need for additional training with mixed multi-modal prompts.
Extensive experiments demonstrate the effectiveness of EMMA in maintaining high
fidelity and detail in generated images, showcasing its potential as a robust
solution for advanced multi-modal conditional image generation tasks.