ゼロショットカスタマイズ画像生成のための拡散自己蒸留

要旨

テキストから画像への拡散モデルは印象的な結果を生み出しますが、細かい制御を望むアーティストにとってはfrustratingなツールです。例えば、特定のインスタンスの画像を新しい文脈で生成する「identity-preserving generation」のような一般的な用途があります。この設定は、画像+テキスト条件付き生成モデルにとって適したものであり、その他のタスク（例：relighting）も同様です。ただし、このようなモデルを直接トレーニングするための高品質なペアデータが不足しています。私たちは、Diffusion Self-Distillationという手法を提案します。これは、事前にトレーニングされたテキストから画像へのモデルを使用して、テキスト条件付きの画像から画像へのタスクのためのデータセットを生成する方法です。最初に、テキストから画像への拡散モデルのコンテキスト内生成能力を活用して、画像のグリッドを作成し、Visual-Languageモデルの支援を受けて大規模なペアデータセットをキュレーションします。その後、キュレーションされたペアデータセットを使用して、テキストから画像へのモデルをテキスト+画像から画像へのモデルに微調整します。Diffusion Self-Distillationが既存のゼロショット手法を凌駕し、幅広いidentity-preservation生成タスクで個別チューニング技術と競合し、テスト時の最適化を必要とせずに優れた性能を発揮することを示します。

English

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

ゼロショットカスタマイズ画像生成のための拡散自己蒸留

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

要旨

Support