IP-Adapter: テキスト互換画像プロンプトアダプター for テキストから画像への拡散モデル

要旨

近年、大規模なテキストから画像への拡散モデルが、高精細な画像を生成する印象的な能力を発揮し、その強力なパワーが注目を集めています。しかし、テキストプロンプトのみを使用して目的の画像を生成することは非常に難しく、複雑なプロンプトエンジニアリングを必要とすることが多いです。テキストプロンプトの代替として、画像プロンプトが挙げられます。ことわざにもあるように、「一枚の画像は千の言葉に値する」のです。既存の事前学習済みモデルからの直接的なファインチューニング手法は有効ですが、大規模な計算リソースを必要とし、他のベースモデルやテキストプロンプト、構造制御との互換性がありません。本論文では、事前学習済みのテキストから画像への拡散モデルに画像プロンプト機能を実現するための、効果的で軽量なアダプターであるIP-Adapterを提案します。私たちのIP-Adapterの鍵となる設計は、テキスト特徴と画像特徴のためのクロスアテンションレイヤーを分離する分離型クロスアテンションメカニズムです。私たちの手法のシンプルさにもかかわらず、わずか22Mのパラメータを持つIP-Adapterは、完全にファインチューニングされた画像プロンプトモデルと同等またはそれ以上の性能を達成できます。事前学習済みの拡散モデルを凍結するため、提案されたIP-Adapterは、同じベースモデルからファインチューニングされた他のカスタムモデルだけでなく、既存の制御可能なツールを使用した制御可能な生成にも一般化できます。分離型クロスアテンション戦略の利点により、画像プロンプトはテキストプロンプトと組み合わせて、マルチモーダルな画像生成を実現することもできます。プロジェクトページはhttps://ip-adapter.github.ioで公開されています。

English

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at https://ip-adapter.github.io.

IP-Adapter: テキスト互換画像プロンプトアダプター for テキストから画像への拡散モデル

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

要旨

Support