λ-ECLIPSE: CLIP潜在空間を活用したマルチコンセプトパーソナライズドテキストto画像拡散モデル

要旨

最近のパーソナライズされたテキストから画像生成（P-T2I）モデルの進展にもかかわらず、被写体駆動型のT2Iは依然として課題が多い。主なボトルネックとして、1) 膨大なトレーニングリソースの必要性、2) ハイパーパラメータの感度による出力の不整合、3) 新しい視覚概念と構図の整合性のバランスが挙げられる。これらの制約に対処するため、我々はまずT2I拡散モデルの核となる哲学を再確認する。現代の被写体駆動型T2Iアプローチは主にLatent Diffusion Models（LDMs）に依存しており、クロスアテンションレイヤーを通じてT2Iマッピングを実現している。LDMsは明らかな利点を提供するものの、P-T2I手法がこれらの拡散モデルの潜在空間に依存することは、リソース需要を大幅に増大させ、結果の不整合を招き、単一の望ましい画像を得るために多数の反復を必要とする。最近、ECLIPSEは、拡散型テキストから画像の事前知識を必要とせずに、UnCLIPベースのT2Iモデルをトレーニングするためのよりリソース効率の良い方法を示した。これを基に、我々はlambda-ECLIPSEを導入する。本手法は、効果的なP-T2Iが必ずしも拡散モデルの潜在空間に依存しないことを示す。lambda-ECLIPSEは、わずか34Mのパラメータと1.6Mの画像-テキストインタリーブデータを用いて、たった74 GPU時間でトレーニングされ、単一、複数被写体、およびエッジガイド付きのT2Iパーソナライゼーションを実現する。広範な実験を通じて、lambda-ECLIPSEが既存のベースラインを構図の整合性において凌駕しつつ、概念の整合性性能を維持し、大幅に低いリソース使用量でこれを達成することを確認した。

English

Despite the recent advances in personalized text-to-image (P-T2I) generative models, subject-driven T2I remains challenging. The primary bottlenecks include 1) Intensive training resource requirements, 2) Hyper-parameter sensitivity leading to inconsistent outputs, and 3) Balancing the intricacies of novel visual concept and composition alignment. We start by re-iterating the core philosophy of T2I diffusion models to address the above limitations. Predominantly, contemporary subject-driven T2I approaches hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. Recently, ECLIPSE has demonstrated a more resource-efficient pathway for training UnCLIP-based T2I models, circumventing the need for diffusion text-to-image priors. Building on this, we introduce lambda-ECLIPSE. Our method illustrates that effective P-T2I does not necessarily depend on the latent space of diffusion models. lambda-ECLIPSE achieves single, multi-subject, and edge-guided T2I personalization with just 34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text interleaved data. Through extensive experiments, we also establish that lambda-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization.

λ-ECLIPSE: CLIP潜在空間を活用したマルチコンセプトパーソナライズドテキストto画像拡散モデル

λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

要旨

Support