BootPIG: 事前学習済み拡散モデルにおけるゼロショットパーソナライズド画像生成能力のブートストラップ

要旨

最近のテキストから画像を生成するモデルは、入力プロンプトに忠実に従った画像を生成するという驚くべき成功を収めています。しかし、望ましい概念を言葉で説明する必要があるため、生成される概念の外観に対する制御は限られています。本研究では、この欠点を解決するために、既存のテキストから画像を生成する拡散モデルにパーソナライゼーション機能を追加するアプローチを提案します。我々は、ユーザーがオブジェクトの参照画像を提供することで、生成画像内の概念の外観をガイドできる新しいアーキテクチャ（BootPIG）を提案します。提案するBootPIGアーキテクチャは、事前学習済みのテキストから画像を生成する拡散モデルに最小限の変更を加え、別のUNetモデルを利用して生成を望ましい外観に向けて誘導します。我々は、事前学習済みのテキストから画像を生成するモデル、LLMチャットエージェント、および画像セグメンテーションモデルから生成されたデータを使用して、BootPIGアーキテクチャにパーソナライゼーション機能をブートストラップするトレーニング手順を導入します。数日間の事前学習を必要とする既存の方法とは対照的に、BootPIGアーキテクチャは約1時間でトレーニングできます。DreamBoothデータセットでの実験により、BootPIGが既存のゼロショット手法を上回り、テスト時のファインチューニングアプローチと同等であることが示されています。ユーザー調査を通じて、BootPIGの生成が参照オブジェクトの外観への忠実性を維持し、テキストプロンプトとの整合性を保つ点で既存の方法よりも好まれることを検証しました。

English

Recent text-to-image generation models have demonstrated incredible success in generating images that faithfully follow input prompts. However, the requirement of using words to describe a desired concept provides limited control over the appearance of the generated concepts. In this work, we address this shortcoming by proposing an approach to enable personalization capabilities in existing text-to-image diffusion models. We propose a novel architecture (BootPIG) that allows a user to provide reference images of an object in order to guide the appearance of a concept in the generated images. The proposed BootPIG architecture makes minimal modifications to a pretrained text-to-image diffusion model and utilizes a separate UNet model to steer the generations toward the desired appearance. We introduce a training procedure that allows us to bootstrap personalization capabilities in the BootPIG architecture using data generated from pretrained text-to-image models, LLM chat agents, and image segmentation models. In contrast to existing methods that require several days of pretraining, the BootPIG architecture can be trained in approximately 1 hour. Experiments on the DreamBooth dataset demonstrate that BootPIG outperforms existing zero-shot methods while being comparable with test-time finetuning approaches. Through a user study, we validate the preference for BootPIG generations over existing methods both in maintaining fidelity to the reference object's appearance and aligning with textual prompts.

BootPIG: 事前学習済み拡散モデルにおけるゼロショットパーソナライズド画像生成能力のブートストラップ

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

要旨

Support