EMMA：您的文本到圖像擴散模型可以秘密接受多模態提示

摘要

最近在影像生成方面取得的進展使得能夠從文字條件中創建高質量的圖像。然而，當面對多模態條件時，例如將文字與參考外觀結合，現有方法往往難以有效平衡多個條件，通常會偏好某一模態而忽略其他模態。為了應對這一挑戰，我們引入了 EMMA，這是一個新穎的圖像生成模型，接受多模態提示，建立在最先進的文本到圖像（T2I）擴散模型 ELLA 的基礎上。EMMA 通過創新的多模態特徵連接器設計，無縫地將額外的模態與文字一起整合，通過特殊的注意機制有效地整合文本和補充模態信息來引導圖像生成。通過凍結原始 T2I 擴散模型中的所有參數，僅調整一些額外層，我們發現一個有趣的結果，即預先訓練的 T2I 擴散模型可以秘密接受多模態提示。這一有趣的特性有助於輕鬆適應不同的現有框架，使 EMMA 成為一個靈活且有效的工具，用於生成個性化和上下文感知的圖像甚至視頻。此外，我們介紹了一種策略，將學習的 EMMA 模塊組裝起來，以同時條件於多個模態的圖像，消除了需要使用混合多模態提示進行額外訓練的需求。大量實驗證明了 EMMA 在生成圖像時保持高保真度和細節的有效性，展示了其作為先進多模態條件圖像生成任務的強大解決方案的潛力。

English

Recent advancements in image generation have enabled the creation of high-quality images from text conditions. However, when facing multi-modal conditions, such as text combined with reference appearances, existing methods struggle to balance multiple conditions effectively, typically showing a preference for one modality over others. To address this challenge, we introduce EMMA, a novel image generation model accepting multi-modal prompts built upon the state-of-the-art text-to-image (T2I) diffusion model, ELLA. EMMA seamlessly incorporates additional modalities alongside text to guide image generation through an innovative Multi-modal Feature Connector design, which effectively integrates textual and supplementary modal information using a special attention mechanism. By freezing all parameters in the original T2I diffusion model and only adjusting some additional layers, we reveal an interesting finding that the pre-trained T2I diffusion model can secretly accept multi-modal prompts. This interesting property facilitates easy adaptation to different existing frameworks, making EMMA a flexible and effective tool for producing personalized and context-aware images and even videos. Additionally, we introduce a strategy to assemble learned EMMA modules to produce images conditioned on multiple modalities simultaneously, eliminating the need for additional training with mixed multi-modal prompts. Extensive experiments demonstrate the effectiveness of EMMA in maintaining high fidelity and detail in generated images, showcasing its potential as a robust solution for advanced multi-modal conditional image generation tasks.

EMMA：您的文本到圖像擴散模型可以秘密接受多模態提示

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

摘要

Support