DreamTuner: 単一画像で実現する対象駆動型生成

要旨

拡散モデルは、テキストから画像を生成する際に印象的な能力を発揮し、1枚または少数の参照画像を用いてカスタマイズされた概念を生成する必要がある、被写体駆動生成のパーソナライズドアプリケーションへの期待が高まっています。しかし、既存のファインチューニングベースの手法では、被写体の学習と事前学習済みモデルの生成能力の維持とのトレードオフを適切にバランスさせることができていません。さらに、追加の画像エンコーダを利用する他の手法では、エンコーディングの圧縮により被写体の重要な詳細が失われる傾向があります。これらの課題に対処するため、我々はDreamTurnerを提案します。これは、参照情報を粗から細へと注入することで、被写体駆動の画像生成をより効果的に実現する新しい手法です。DreamTurnerは、粗い被写体の同一性を保持するための被写体エンコーダを導入し、視覚-テキスト間のクロスアテンションの前に、圧縮された一般的な被写体の特徴をアテンションレイヤーを通じて導入します。次に、事前学習済みのテキストから画像へのモデル内の自己アテンションレイヤーを自己被写体アテンションレイヤーに変更し、ターゲット被写体の詳細を洗練します。生成された画像は、自己被写体アテンションにおいて、参照画像と自身の両方から詳細な特徴をクエリします。自己被写体アテンションは、カスタマイズされた被写体の詳細な特徴を維持するための効果的でエレガントな、かつトレーニング不要の手法であり、推論時にプラグアンドプレイのソリューションとして機能し得ることを強調する価値があります。最後に、追加の被写体駆動ファインチューニングにより、DreamTurnerは被写体駆動の画像生成において顕著な性能を達成し、テキストやポーズなどの他の条件によって制御可能です。詳細については、プロジェクトページhttps://dreamtuner-diffusion.github.io/をご覧ください。

English

Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at https://dreamtuner-diffusion.github.io/.

DreamTuner: 単一画像で実現する対象駆動型生成

DreamTuner: Single Image is Enough for Subject-Driven Generation

要旨

Support