FastComposer: 지역화된 주의 메커니즘을 통한 튜닝 없이 다중 주체 이미지 생성

초록

디퓨전 모델은 텍스트-이미지 생성, 특히 개인화된 이미지를 위한 주체 중심 생성에서 뛰어난 성능을 보입니다. 그러나 기존 방법들은 주체별 미세 조정(fine-tuning)이 필요하여 계산 비용이 높고 효율적인 배포를 방해한다는 문제가 있습니다. 또한, 기존 방법들은 다중 주체 생성에서 주체 간 특징이 혼합되는 문제를 겪습니다. 우리는 미세 조정 없이도 효율적이고 개인화된 다중 주체 텍스트-이미지 생성을 가능하게 하는 FastComposer를 제안합니다. FastComposer는 이미지 인코더로 추출한 주체 임베딩을 사용하여 디퓨전 모델의 일반적인 텍스트 조건을 보강함으로써, 주체 이미지와 텍스트 지시만으로 전방향 패스(forward pass)만으로 개인화된 이미지 생성을 가능하게 합니다. 다중 주체 생성에서의 정체성 혼합 문제를 해결하기 위해, FastComposer는 학습 중에 교차 주의력(cross-attention) 지역화 감독을 제안하여, 참조 주체의 주의력이 대상 이미지의 올바른 영역에 집중되도록 강제합니다. 주체 임베딩에 단순히 조건을 부여하면 주체 과적합이 발생할 수 있습니다. FastComposer는 노이즈 제거(denoising) 단계에서 지연된 주체 조건을 제안하여 주체 중심 이미지 생성에서 정체성과 편집 가능성을 모두 유지합니다. FastComposer는 다양한 스타일, 동작, 맥락에서 보지 못한 다수의 개인 이미지를 생성합니다. 이는 미세 조정 기반 방법 대비 300배에서 2500배의 속도 향상을 달성하며, 새로운 주체에 대한 추가 저장 공간이 필요하지 않습니다. FastComposer는 효율적이고 개인화된 고품질 다중 주체 이미지 생성의 길을 열어줍니다. 코드, 모델, 데이터셋은 https://github.com/mit-han-lab/fastcomposer에서 확인할 수 있습니다.

English

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300times-2500times speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

FastComposer: 지역화된 주의 메커니즘을 통한 튜닝 없이 다중 주체 이미지 생성

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

초록

Support