无需训练的一致文本到图像生成

摘要

文本到图像模型通过允许用户通过自然语言引导图像生成过程，提供了新的创意灵活性。然而，使用这些模型来在不同提示中始终描绘相同主题仍然具有挑战性。现有方法通过微调模型来教授描述特定用户提供主题的新词汇，或向模型添加图像条件。这些方法需要针对每个主题进行漫长的优化或大规模预训练。此外，它们很难将生成的图像与文本提示对齐，并且在描绘多个主题时遇到困难。在这里，我们提出了ConsiStory，这是一种无需训练的方法，通过共享预训练模型的内部激活来实现一致的主题生成。我们引入了一个以主题驱动的共享注意力块和基于对应关系的特征注入，以促进图像之间的主题一致性。此外，我们开发了策略，以鼓励布局多样性同时保持主题一致性。我们将ConsiStory与一系列基准进行比较，并展示了在主题一致性和文本对齐方面的最先进性能，而无需进行任何优化步骤。最后，ConsiStory可以自然地扩展到多主题场景，并甚至实现无需训练的常见对象个性化。

English

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

无需训练的一致文本到图像生成

Training-Free Consistent Text-to-Image Generation

摘要

Support