FastComposer:无调谐多主体图像生成与局部化注意力
FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention
May 17, 2023
作者: Guangxuan Xiao, Tianwei Yin, William T. Freeman, Frédo Durand, Song Han
cs.AI
摘要
扩散模型在文本到图像生成方面表现出色,尤其在面向个性化图像的主题驱动生成方面。然而,由于需要针对特定主题进行微调,现有方法效率低下,计算密集且阻碍了高效部署。此外,现有方法在多主题生成方面存在困难,因为它们经常在不同主题之间混合特征。我们提出了FastComposer,它实现了高效、个性化、多主题文本到图像生成,无需进行微调。FastComposer利用图像编码器提取的主题嵌入来增强扩散模型中的通用文本条件,实现基于主题图像和文本指令的个性化图像生成,仅需进行前向传递。为了解决多主题生成中的身份混合问题,FastComposer在训练过程中提出了交叉注意力定位监督,强制引用主题的注意力定位到目标图像中的正确区域。简单地基于主题嵌入进行条件设置会导致主题过拟合。FastComposer提出了在去噪步骤中延迟主题条件设置,以在主题驱动的图像生成中保持身份和可编辑性。FastComposer生成了多个不同风格、动作和背景的未见个体图像。与基于微调的方法相比,FastComposer实现了300倍至2500倍的加速,并且对于新主题不需要额外存储空间。FastComposer为高效、个性化和高质量的多主题图像创作铺平了道路。代码、模型和数据集可在https://github.com/mit-han-lab/fastcomposer获取。
English
Diffusion models excel at text-to-image generation, especially in
subject-driven generation for personalized images. However, existing methods
are inefficient due to the subject-specific fine-tuning, which is
computationally intensive and hampers efficient deployment. Moreover, existing
methods struggle with multi-subject generation as they often blend features
among subjects. We present FastComposer which enables efficient, personalized,
multi-subject text-to-image generation without fine-tuning. FastComposer uses
subject embeddings extracted by an image encoder to augment the generic text
conditioning in diffusion models, enabling personalized image generation based
on subject images and textual instructions with only forward passes. To address
the identity blending problem in the multi-subject generation, FastComposer
proposes cross-attention localization supervision during training, enforcing
the attention of reference subjects localized to the correct regions in the
target images. Naively conditioning on subject embeddings results in subject
overfitting. FastComposer proposes delayed subject conditioning in the
denoising step to maintain both identity and editability in subject-driven
image generation. FastComposer generates images of multiple unseen individuals
with different styles, actions, and contexts. It achieves
300times-2500times speedup compared to fine-tuning-based methods and
requires zero extra storage for new subjects. FastComposer paves the way for
efficient, personalized, and high-quality multi-subject image creation. Code,
model, and dataset are available at
https://github.com/mit-han-lab/fastcomposer.