ViCo：保留细节的视觉条件用于个性化文本到图像生成

摘要

最近，个性化文本到图像生成使用扩散模型被提出并引起了广泛关注。给定包含新概念（例如独特玩具）的少量图像，我们旨在调整生成模型以捕捉新概念的精细视觉细节，并根据文本条件生成逼真的图像。我们提出了一种名为ViCo的插件方法，用于快速轻量化的个性化生成。具体来说，我们提出了一个图像注意力模块，以在基于补丁的视觉语义上调节扩散过程。我们引入了一个基于注意力的对象蒙版，几乎不需要额外成本。此外，我们设计了一个简单的正则化方法，基于文本-图像注意力映射的内在属性，以减轻常见的过拟合降级问题。与许多现有模型不同，我们的方法不对原始扩散模型的任何参数进行微调。这使得模型部署更加灵活和可转移。通过仅进行轻量级参数训练（约扩散 U-Net 的 6%），我们的方法在定性和定量上均实现了与所有最先进模型相媲美甚至更好的性能。

English

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

ViCo：保留细节的视觉条件用于个性化文本到图像生成

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

摘要

Support