TextBoost：通過微調文本編碼器實現一次性個性化文本到圖像模型

摘要

最近在文本到圖像模型方面的突破開拓了個性化圖像生成的研究新途徑，讓用戶能夠使用自然語言提示創建特定主題的多樣圖像。然而，現有方法在僅提供單張參考圖像時往往表現下降。它們傾向於過度擬合輸入，無論文本提示如何，都會生成高度相似的輸出。本文解決了一次性個性化的挑戰，通過減輕過度擬合，實現了通過文本提示創建可控圖像。具體而言，我們提出了一種著重於文本編碼器的選擇性微調策略。此外，我們引入了三個關鍵技術來增強個性化性能：(1) 增強標記以鼓勵特徵解耦和減輕過度擬合，(2) 保持知識損失以減少語言漂移並促進在不同提示間的泛化能力，以及(3) SNR 加權取樣以進行高效訓練。大量實驗表明，我們的方法能夠高效生成高質量、多樣的圖像，僅使用單張參考圖像，同時顯著減少內存和存儲需求。

English

Recent breakthroughs in text-to-image models have opened up promising research avenues in personalized image generation, enabling users to create diverse images of a specific subject using natural language prompts. However, existing methods often suffer from performance degradation when given only a single reference image. They tend to overfit the input, producing highly similar outputs regardless of the text prompt. This paper addresses the challenge of one-shot personalization by mitigating overfitting, enabling the creation of controllable images through text prompts. Specifically, we propose a selective fine-tuning strategy that focuses on the text encoder. Furthermore, we introduce three key techniques to enhance personalization performance: (1) augmentation tokens to encourage feature disentanglement and alleviate overfitting, (2) a knowledge-preservation loss to reduce language drift and promote generalizability across diverse prompts, and (3) SNR-weighted sampling for efficient training. Extensive experiments demonstrate that our approach efficiently generates high-quality, diverse images using only a single reference image while significantly reducing memory and storage requirements.

TextBoost：通過微調文本編碼器實現一次性個性化文本到圖像模型

TextBoost: Towards One-Shot Personalization of Text-to-Image Models via Fine-tuning Text Encoder

摘要

Support