無需訓練的一致性文本到圖像生成

摘要

文字轉圖模型通過允許用戶通過自然語言引導圖像生成過程，提供了新的創意靈活性水平。然而，通過這些模型在不同提示下一致描繪相同主題仍然具有挑戰性。現有方法對模型進行微調，以教導它描述特定用戶提供的主題的新詞彙，或者向模型添加圖像條件。這些方法需要冗長的每個主題優化或大規模預訓練。此外，它們難以將生成的圖像與文本提示對齊，並且在描繪多個主題時遇到困難。在這裡，我們提出了ConsiStory，一種無需訓練的方法，通過共享預訓練模型的內部激活來實現一致的主題生成。我們引入了一個主題驅動的共享注意塊和基於對應的特徵注入，以促進圖像之間的主題一致性。此外，我們制定了策略，以鼓勵版面多樣性，同時保持主題一致性。我們將ConsiStory與一系列基準進行比較，展示了在主題一致性和文本對齊方面的最新性能，而無需進行任何優化步驟。最後，ConsiStory可以自然擴展到多主題場景，甚至實現無需訓練的常見物件個性化。

English

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

無需訓練的一致性文本到圖像生成

Training-Free Consistent Text-to-Image Generation

摘要

Support