ChatPaper.aiChatPaper

IP-Adapter:用於文本到圖像擴散模型的文本兼容圖像提示適配器。

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

August 13, 2023
作者: Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, Wei Yang
cs.AI

摘要

近年來,大型文本到圖像擴散模型展現了強大的能力,具有令人印象深刻的生成能力,可以創建高保真度的圖像。然而,僅使用文本提示生成所需圖像非常棘手,因為這通常涉及複雜的提示工程。作為文言所說,文如其人,圖如其意。儘管現有的直接微調預訓練模型的方法是有效的,但它們需要大量計算資源,並且與其他基本模型、文本提示和結構控制不兼容。在本文中,我們提出了IP-Adapter,一種有效且輕量級的適配器,用於實現預訓練文本到圖像擴散模型的圖像提示功能。我們的IP-Adapter的關鍵設計是解耦的交叉注意機制,將文本特徵和圖像特徵的交叉注意層分開。儘管我們的方法簡單,但只有2200萬參數的IP-Adapter可以實現與完全微調的圖像提示模型相當甚至更好的性能。由於我們凍結了預訓練的擴散模型,所提出的IP-Adapter不僅可以泛化到從相同基本模型微調的其他自定義模型,還可以應用於使用現有可控工具進行可控生成。通過解耦的交叉注意策略的好處,圖像提示也可以與文本提示良好配合,實現多模態圖像生成。項目頁面位於https://ip-adapter.github.io。
English
Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at https://ip-adapter.github.io.
PDF302December 15, 2024