オブジェクト駆動型ワンショットファインチューニングによるテキストから画像への拡散モデルのプロトタイプ埋め込み

要旨

大規模なテキストから画像生成モデルがテキストから画像生成の分野で目覚ましい進歩を遂げる中、多くのファインチューニング手法が提案されてきました。しかし、これらのモデルは新しいオブジェクト、特にワンショットシナリオにおいて苦戦することが多いです。私たちが提案する手法は、単一の入力画像とオブジェクト固有の関心領域のみを使用して、汎用性と忠実度の課題をオブジェクト駆動型で解決することを目指しています。汎用性を向上させ、過学習を緩和するために、私たちのパラダイムでは、拡散モデルのファインチューニングを行う前に、オブジェクトの外観とそのクラスに基づいてプロトタイプ埋め込みを初期化します。また、ファインチューニング中に、オブジェクトクラスの事前知識を保持するためのクラス特性正則化を提案します。さらに、忠実度を向上させるために、オブジェクト固有の損失を導入し、これを使用して複数のオブジェクトを埋め込むことも可能です。全体として、私たちが提案する新しいオブジェクトを埋め込むためのオブジェクト駆動型手法は、既存の概念とシームレスに統合でき、高い忠実度と汎用性を実現します。私たちの手法は、いくつかの既存の研究を上回る性能を示しています。コードは公開予定です。

English

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization. Our method outperforms several existing works. The code will be released.

オブジェクト駆動型ワンショットファインチューニングによるテキストから画像への拡散モデルのプロトタイプ埋め込み

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

要旨

Support