ViCo：パーソナライズされたテキストから画像生成のための詳細保持型ビジュアル条件付け

要旨

拡散モデルを用いたパーソナライズドテキスト画像生成が最近提案され、大きな注目を集めています。新しい概念（例えば、ユニークなおもちゃ）を含む少数の画像が与えられた場合、生成モデルを微調整して、その新しい概念の細かい視覚的詳細を捉え、テキスト条件に従ったフォトリアルな画像を生成することを目指します。本論文では、高速で軽量なパーソナライズド生成のためのプラグイン方式「ViCo」を提案します。具体的には、パッチ単位の視覚的セマンティクスに基づいて拡散プロセスを条件付けるための画像アテンションモジュールを提案します。また、アテンションモジュールからほとんどコストをかけずに得られるアテンションベースのオブジェクトマスクを導入します。さらに、テキスト画像アテンションマップの内在的特性に基づいたシンプルな正則化を設計し、一般的な過学習の劣化を軽減します。多くの既存モデルとは異なり、本手法では元の拡散モデルのパラメータを微調整しません。これにより、より柔軟で転移可能なモデルデプロイメントが可能になります。軽量なパラメータ学習（拡散U-Netの約6%）のみで、本手法は質的・量的に全ての最先端モデルと同等またはそれ以上の性能を達成します。

English

Personalized text-to-image generation using diffusion models has recently been proposed and attracted lots of attention. Given a handful of images containing a novel concept (e.g., a unique toy), we aim to tune the generative model to capture fine visual details of the novel concept and generate photorealistic images following a text condition. We present a plug-in method, named ViCo, for fast and lightweight personalized generation. Specifically, we propose an image attention module to condition the diffusion process on the patch-wise visual semantics. We introduce an attention-based object mask that comes almost at no cost from the attention module. In addition, we design a simple regularization based on the intrinsic properties of text-image attention maps to alleviate the common overfitting degradation. Unlike many existing models, our method does not finetune any parameters of the original diffusion model. This allows more flexible and transferable model deployment. With only light parameter training (~6% of the diffusion U-Net), our method achieves comparable or even better performance than all state-of-the-art models both qualitatively and quantitatively.

ViCo：パーソナライズされたテキストから画像生成のための詳細保持型ビジュアル条件付け

ViCo: Detail-Preserving Visual Condition for Personalized Text-to-Image Generation

要旨

Support