ViCo：保留細節的視覺條件，用於個性化文本到圖像生成

摘要

最近提出了使用擴散模型進行個性化文本到圖像生成，並引起了廣泛關注。給定包含新概念（例如獨特玩具）的少量圖像，我們旨在調整生成模型以捕捉新概念的精細視覺細節，並根據文本條件生成照片般逼真的圖像。我們提出了一種名為ViCo的插件方法，用於快速輕量級的個性化生成。具體而言，我們提出了一個圖像注意力模塊，以在基於區塊的視覺語義上條件化擴散過程。我們引入了一個基於注意力的對象遮罩，幾乎不需要額外成本。此外，我們設計了一個簡單的正則化方法，基於文本-圖像注意力映射的內在特性，以減輕常見的過度擬合降級問題。與許多現有模型不同，我們的方法不對原始擴散模型的任何參數進行微調。這使得模型部署更靈活且易於轉移。通過僅輕量參數訓練（擴散 U-Net 的約6%），我們的方法在質量和量化方面均實現了與所有最先進模型相當甚至更好的性能。

English

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

ViCo：保留細節的視覺條件，用於個性化文本到圖像生成

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

摘要

Support