텍스트-이미지 모델의 빠른 개인화를 위한 도메인-불문 튜닝 인코더

초록

텍스트-이미지(T2I) 개인화는 사용자가 자연어 프롬프트에 자신의 시각적 개념을 결합하여 창의적인 이미지 생성 과정을 안내할 수 있게 합니다. 최근, 인코더 기반 기술이 T2I 개인화를 위한 새로운 효과적인 접근 방식으로 등장하며, 다수의 이미지와 긴 학습 시간의 필요성을 줄였습니다. 그러나 대부분의 기존 인코더는 단일 클래스 도메인에 국한되어 있어 다양한 개념을 처리하는 능력이 제한됩니다. 본 연구에서는 특수화된 데이터셋이나 개인화된 개념에 대한 사전 정보가 필요 없는 도메인-불가지론적 방법을 제안합니다. 우리는 예측된 토큰을 기존 CLIP 토큰 중 가장 가까운 토큰으로 밀어내어, 목표 개념 특성에 대한 높은 충실도를 유지하면서도 예측된 임베딩이 편집 가능한 잠재 공간 영역에 가깝게 유지되도록 하는 새로운 대조 기반 정규화 기법을 소개합니다. 실험 결과는 우리의 접근 방식의 효과를 입증하며, 정규화되지 않은 모델에 의해 예측된 토큰보다 학습된 토큰이 더 의미론적임을 보여줍니다. 이는 이전 방법들보다 더 유연하면서도 최첨단 성능을 달성하는 더 나은 표현을 가능하게 합니다.

English

Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times. However, most existing encoders are limited to a single-class domain, which hinders their ability to handle diverse concepts. In this work, we propose a domain-agnostic method that does not require any specialized dataset or prior information about the personalized concepts. We introduce a novel contrastive-based regularization technique to maintain high fidelity to the target concept characteristics while keeping the predicted embeddings close to editable regions of the latent space, by pushing the predicted tokens toward their nearest existing CLIP tokens. Our experimental results demonstrate the effectiveness of our approach and show how the learned tokens are more semantic than tokens predicted by unregularized models. This leads to a better representation that achieves state-of-the-art performance while being more flexible than previous methods.

텍스트-이미지 모델의 빠른 개인화를 위한 도메인-불문 튜닝 인코더

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

초록

Support