FastComposer:使用局部化注意力的無調整多主題圖像生成
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
May 17, 2023
作者: Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
cs.AI
摘要
擴散模型在文本到圖像生成方面表現出色,特別是在以主題驅動的個性化圖像生成方面。然而,由於現有方法需要進行特定主題的微調,這在計算上是耗費大量資源的,並且阻礙了有效部署。此外,現有方法在多主題生成方面存在困難,因為它們通常會在不同主題之間混合特徵。我們提出了FastComposer,它實現了高效、個性化、多主題文本到圖像生成,而無需進行微調。FastComposer使用由圖像編碼器提取的主題嵌入來擴充擴散模型中的通用文本條件,實現基於主題圖像和文本指示的個性化圖像生成,僅需前向傳遞。為了解決多主題生成中的身份混合問題,FastComposer在訓練過程中提出了交叉注意力定位監督,強制參考主題的注意力集中在目標圖像的正確區域。僅僅基於主題嵌入進行條件設置會導致主題過度擬合。FastComposer提出了在去噪步驟中延遲主題條件設置,以在以主題驅動的圖像生成中保持身份和可編輯性。FastComposer生成了多個不同風格、動作和情境的未見個體的圖像。與基於微調的方法相比,FastComposer實現了300倍至2500倍的加速,並且對於新主題不需要額外的存儲空間。FastComposer為高效、個性化和高質量的多主題圖像創作鋪平了道路。代碼、模型和數據集可在https://github.com/mit-han-lab/fastcomposer找到。
English
Diffusion models excel at text-to-image generation, especially in
subject-driven generation for personalized images. However, existing methods
are inefficient due to the subject-specific fine-tuning, which is
computationally intensive and hampers efficient deployment. Moreover, existing
methods struggle with multi-subject generation as they often blend features
among subjects. We present FastComposer which enables efficient, personalized,
multi-subject text-to-image generation without fine-tuning. FastComposer uses
subject embeddings extracted by an image encoder to augment the generic text
conditioning in diffusion models, enabling personalized image generation based
on subject images and textual instructions with only forward passes. To address
the identity blending problem in the multi-subject generation, FastComposer
proposes cross-attention localization supervision during training, enforcing
the attention of reference subjects localized to the correct regions in the
target images. Naively conditioning on subject embeddings results in subject
overfitting. FastComposer proposes delayed subject conditioning in the
denoising step to maintain both identity and editability in subject-driven
image generation. FastComposer generates images of multiple unseen individuals
with different styles, actions, and contexts. It achieves
300times-2500times speedup compared to fine-tuning-based methods and
requires zero extra storage for new subjects. FastComposer paves the way for
efficient, personalized, and high-quality multi-subject image creation. Code,
model, and dataset are available at
https://github.com/mit-han-lab/fastcomposer.