无需训练的一致文本到图像生成
Training-Free Consistent Text-to-Image Generation
February 5, 2024
作者: Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, Yuval Atzmon
cs.AI
摘要
文本到图像模型通过允许用户通过自然语言引导图像生成过程,提供了新的创意灵活性。然而,使用这些模型来在不同提示中始终描绘相同主题仍然具有挑战性。现有方法通过微调模型来教授描述特定用户提供主题的新词汇,或向模型添加图像条件。这些方法需要针对每个主题进行漫长的优化或大规模预训练。此外,它们很难将生成的图像与文本提示对齐,并且在描绘多个主题时遇到困难。在这里,我们提出了ConsiStory,这是一种无需训练的方法,通过共享预训练模型的内部激活来实现一致的主题生成。我们引入了一个以主题驱动的共享注意力块和基于对应关系的特征注入,以促进图像之间的主题一致性。此外,我们开发了策略,以鼓励布局多样性同时保持主题一致性。我们将ConsiStory与一系列基准进行比较,并展示了在主题一致性和文本对齐方面的最先进性能,而无需进行任何优化步骤。最后,ConsiStory可以自然地扩展到多主题场景,并甚至实现无需训练的常见对象个性化。
English
Text-to-image models offer a new level of creative flexibility by allowing
users to guide the image generation process through natural language. However,
using these models to consistently portray the same subject across diverse
prompts remains challenging. Existing approaches fine-tune the model to teach
it new words that describe specific user-provided subjects or add image
conditioning to the model. These methods require lengthy per-subject
optimization or large-scale pre-training. Moreover, they struggle to align
generated images with text prompts and face difficulties in portraying multiple
subjects. Here, we present ConsiStory, a training-free approach that enables
consistent subject generation by sharing the internal activations of the
pretrained model. We introduce a subject-driven shared attention block and
correspondence-based feature injection to promote subject consistency between
images. Additionally, we develop strategies to encourage layout diversity while
maintaining subject consistency. We compare ConsiStory to a range of baselines,
and demonstrate state-of-the-art performance on subject consistency and text
alignment, without requiring a single optimization step. Finally, ConsiStory
can naturally extend to multi-subject scenarios, and even enable training-free
personalization for common objects.