훈련 없이 일관된 텍스트-이미지 생성

초록

텍스트-이미지 모델은 사용자가 자연어를 통해 이미지 생성 과정을 안내할 수 있게 함으로써 새로운 수준의 창의적 유연성을 제공합니다. 그러나 이러한 모델을 사용하여 다양한 프롬프트에서 동일한 주제를 일관되게 표현하는 것은 여전히 어려운 과제입니다. 기존 접근 방식은 특정 사용자 제공 주제를 설명하는 새로운 단어를 모델에 가르치기 위해 모델을 미세 조정하거나 이미지 조건화를 추가합니다. 이러한 방법은 주제별로 긴 최적화 과정이나 대규모 사전 학습이 필요합니다. 더욱이, 생성된 이미지를 텍스트 프롬프트와 정렬하는 데 어려움을 겪으며, 여러 주제를 표현하는 데도 문제가 있습니다. 여기서 우리는 사전 학습된 모델의 내부 활성화를 공유함으로써 일관된 주제 생성을 가능하게 하는 학습이 필요 없는 접근 방식인 ConsiStory를 제시합니다. 우리는 이미지 간 주제 일관성을 촉진하기 위해 주제 기반 공유 어텐션 블록과 대응 기반 특징 주입을 도입했습니다. 또한, 주제 일관성을 유지하면서 레이아웃 다양성을 장려하기 위한 전략을 개발했습니다. 우리는 ConsiStory를 다양한 베이스라인과 비교하고, 단일 최적화 단계 없이도 주제 일관성과 텍스트 정렬에서 최첨단 성능을 입증했습니다. 마지막으로, ConsiStory는 다중 주제 시나리오로 자연스럽게 확장될 수 있으며, 일반적인 객체에 대한 학습 없는 개인화도 가능하게 합니다.

English

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

훈련 없이 일관된 텍스트-이미지 생성

Training-Free Consistent Text-to-Image Generation

초록

Support