領域不可知調整編碼器，用於快速個性化文本到圖像模型

摘要

文本到圖像（T2I）個性化技術允許使用者通過結合自己在自然語言提示中的視覺概念來引導創意圖像生成過程。最近，基於編碼器的技術已經成為T2I個性化的一種新有效方法，減少了對多個圖像和長時間訓練的需求。然而，大多數現有的編碼器僅限於單一類別領域，這限制了它們處理多樣概念的能力。在這項工作中，我們提出了一種不需要任何專門數據集或有關個性化概念的先前信息的通用方法。我們引入了一種新穎的基於對比的正則化技術，以保持對目標概念特徵的高保真度，同時使預測的嵌入保持接近潛在空間的可編輯區域，通過將預測的標記推向其最近的現有CLIP標記。我們的實驗結果證明了我們方法的有效性，並展示了學習到的標記比未經正則化模型預測的標記更具語義。這導致更好的表示，實現了最先進的性能，同時比以前的方法更靈活。

English

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

領域不可知調整編碼器，用於快速個性化文本到圖像模型

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

摘要

Support