FastComposer: ローカライズドアテンションを用いたチューニング不要なマルチサブジェクト画像生成

要旨

拡散モデルは、特にパーソナライズされた画像のための被写体駆動生成において、テキストから画像への生成に優れています。しかし、既存の手法は被写体固有のファインチューニングが必要であり、計算コストが高く、効率的な展開を妨げています。さらに、既存の手法は複数の被写体を生成する際に、しばしば被写体間で特徴が混ざり合うという問題に直面しています。本論文では、ファインチューニングなしで効率的でパーソナライズされた複数被写体のテキストから画像への生成を可能にするFastComposerを提案します。FastComposerは、画像エンコーダによって抽出された被写体埋め込みを用いて、拡散モデルの一般的なテキスト条件付けを強化し、被写体画像とテキスト指示に基づいてフォワードパスのみでパーソナライズされた画像生成を実現します。複数被写体生成における同一性の混ざり合い問題に対処するため、FastComposerはトレーニング中にクロスアテンションの局所化監視を提案し、参照被写体のアテンションがターゲット画像の正しい領域に局在化することを強制します。被写体埋め込みを単純に条件付けすると被写体の過学習が起こります。FastComposerは、デノイジングステップでの遅延被写体条件付けを提案し、被写体駆動画像生成において同一性と編集可能性の両方を維持します。FastComposerは、異なるスタイル、アクション、コンテキストを持つ複数の未知の個人の画像を生成します。ファインチューニングベースの手法と比較して300倍から2500倍の高速化を達成し、新しい被写体に対して追加のストレージを必要としません。FastComposerは、効率的でパーソナライズされた高品質な複数被写体画像作成の道を開きます。コード、モデル、データセットはhttps://github.com/mit-han-lab/fastcomposerで公開されています。

English

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300times-2500times speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

FastComposer: ローカライズドアテンションを用いたチューニング不要なマルチサブジェクト画像生成

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

要旨

Support