領域不可知調整編碼器,用於快速個性化文本到圖像模型
Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models
July 13, 2023
作者: Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, Amit H. Bermano
cs.AI
摘要
文本到圖像(T2I)個性化技術允許使用者通過結合自己在自然語言提示中的視覺概念來引導創意圖像生成過程。最近,基於編碼器的技術已經成為T2I個性化的一種新有效方法,減少了對多個圖像和長時間訓練的需求。然而,大多數現有的編碼器僅限於單一類別領域,這限制了它們處理多樣概念的能力。在這項工作中,我們提出了一種不需要任何專門數據集或有關個性化概念的先前信息的通用方法。我們引入了一種新穎的基於對比的正則化技術,以保持對目標概念特徵的高保真度,同時使預測的嵌入保持接近潛在空間的可編輯區域,通過將預測的標記推向其最近的現有CLIP標記。我們的實驗結果證明了我們方法的有效性,並展示了學習到的標記比未經正則化模型預測的標記更具語義。這導致更好的表示,實現了最先進的性能,同時比以前的方法更靈活。
English
Text-to-image (T2I) personalization allows users to guide the creative image
generation process by combining their own visual concepts in natural language
prompts. Recently, encoder-based techniques have emerged as a new effective
approach for T2I personalization, reducing the need for multiple images and
long training times. However, most existing encoders are limited to a
single-class domain, which hinders their ability to handle diverse concepts. In
this work, we propose a domain-agnostic method that does not require any
specialized dataset or prior information about the personalized concepts. We
introduce a novel contrastive-based regularization technique to maintain high
fidelity to the target concept characteristics while keeping the predicted
embeddings close to editable regions of the latent space, by pushing the
predicted tokens toward their nearest existing CLIP tokens. Our experimental
results demonstrate the effectiveness of our approach and show how the learned
tokens are more semantic than tokens predicted by unregularized models. This
leads to a better representation that achieves state-of-the-art performance
while being more flexible than previous methods.