领域无关调谐编码器，用于快速个性化文本到图像模型

摘要

文本到图像（T2I）个性化允许用户通过将自己的视觉概念与自然语言提示相结合来引导创意图像生成过程。最近，基于编码器的技术已经成为T2I个性化的一种新有效方法，减少了对多个图像和长时间训练的需求。然而，大多数现有的编码器仅限于单一类域，这限制了它们处理多样概念的能力。在这项工作中，我们提出了一种领域无关的方法，不需要任何专门的数据集或关于个性化概念的先验信息。我们引入了一种新颖的基于对比的正则化技术，以保持对目标概念特征的高保真度，同时使预测的嵌入保持接近潜在空间的可编辑区域，通过将预测的标记推向其最近的现有CLIP标记。我们的实验结果证明了我们方法的有效性，并展示了学习到的标记比未经正则化模型预测的标记更具语义性。这导致了更好的表示，实现了最先进的性能，同时比先前的方法更灵活。

English

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

领域无关调谐编码器，用于快速个性化文本到图像模型

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

摘要

Support