ChatPaper.aiChatPaper

IP-Adapter:用于文本到图像扩散模型的文本兼容图像提示适配器

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

August 13, 2023
作者: Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang
cs.AI

摘要

近年来,大型文本到图像扩散模型展现出强大的力量,具有令人印象深刻的生成能力,可以创造高保真度的图像。然而,仅使用文本提示生成所需图像非常棘手,通常涉及复杂的提示工程。作为文言所说,“一图胜千言”,图像提示是文本提示的替代方案。尽管直接从预训练模型进行微调的现有方法是有效的,但它们需要大量计算资源,并且与其他基础模型、文本提示和结构控制不兼容。在本文中,我们提出了IP-Adapter,这是一种有效且轻量级的适配器,用于实现预训练文本到图像扩散模型的图像提示功能。我们的IP-Adapter的关键设计是解耦的交叉注意力机制,将文本特征和图像特征的交叉注意力层分开。尽管我们的方法简单,但只有2200万参数的IP-Adapter可以实现与完全微调的图像提示模型相当甚至更好的性能。由于我们冻结了预训练扩散模型,所提出的IP-Adapter不仅可以泛化到从相同基础模型微调的其他自定义模型,还可以用于使用现有可控工具进行可控生成。借助解耦的交叉注意力策略,图像提示也可以与文本提示很好地配合,实现多模态图像生成。该项目页面位于https://ip-adapter.github.io。
English
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at https://ip-adapter.github.io.
PDF302December 15, 2024