FastComposer：无调谐多主体图像生成与局部化注意力

摘要

扩散模型在文本到图像生成方面表现出色，尤其在面向个性化图像的主题驱动生成方面。然而，由于需要针对特定主题进行微调，现有方法效率低下，计算密集且阻碍了高效部署。此外，现有方法在多主题生成方面存在困难，因为它们经常在不同主题之间混合特征。我们提出了FastComposer，它实现了高效、个性化、多主题文本到图像生成，无需进行微调。FastComposer利用图像编码器提取的主题嵌入来增强扩散模型中的通用文本条件，实现基于主题图像和文本指令的个性化图像生成，仅需进行前向传递。为了解决多主题生成中的身份混合问题，FastComposer在训练过程中提出了交叉注意力定位监督，强制引用主题的注意力定位到目标图像中的正确区域。简单地基于主题嵌入进行条件设置会导致主题过拟合。FastComposer提出了在去噪步骤中延迟主题条件设置，以在主题驱动的图像生成中保持身份和可编辑性。FastComposer生成了多个不同风格、动作和背景的未见个体图像。与基于微调的方法相比，FastComposer实现了300倍至2500倍的加速，并且对于新主题不需要额外存储空间。FastComposer为高效、个性化和高质量的多主题图像创作铺平了道路。代码、模型和数据集可在https://github.com/mit-han-lab/fastcomposer获取。

English

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300times-2500times speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

FastComposer：无调谐多主体图像生成与局部化注意力

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

摘要

Support