λ-ECLIPSE:利用CLIP潜空间的多概念个性化文本到图像扩散模型
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
February 7, 2024
作者: Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang
cs.AI
摘要
尽管个性化文本到图像(P-T2I)生成模型近年来取得了进展,但基于主题的T2I仍然具有挑战性。主要瓶颈包括:1)需要大量训练资源,2)超参数敏感性导致输出不一致,3)平衡新颖视觉概念和构图对齐的复杂性。我们首先重新阐述了T2I扩散模型的核心理念,以解决上述限制。主要是,当代主题驱动的T2I方法依赖于潜在扩散模型(LDMs),通过交叉注意力层促进T2I映射。虽然LDMs具有明显优势,但P-T2I方法对这些扩散模型的潜在空间的依赖显著增加了资源需求,导致结果不一致,并需要多次迭代才能获得单个期望图像。最近,ECLIPSE展示了一种更节约资源的途径,用于训练基于UnCLIP的T2I模型,避免了对扩散文本到图像先验的需求。在此基础上,我们介绍了lambda-ECLIPSE。我们的方法表明,有效的P-T2I不一定依赖于扩散模型的潜在空间。lambda-ECLIPSE通过仅使用34M参数,在仅使用74个GPU小时的情况下,基于160万图像文本交错数据实现了单一、多主题和边缘引导的T2I个性化。通过大量实验,我们还确定lambda-ECLIPSE在构图对齐方面超越了现有基准线,同时保持概念对齐性能,即使资源利用明显较低。
English
Despite the recent advances in personalized text-to-image (P-T2I) generative
models, subject-driven T2I remains challenging. The primary bottlenecks include
1) Intensive training resource requirements, 2) Hyper-parameter sensitivity
leading to inconsistent outputs, and 3) Balancing the intricacies of novel
visual concept and composition alignment. We start by re-iterating the core
philosophy of T2I diffusion models to address the above limitations.
Predominantly, contemporary subject-driven T2I approaches hinge on Latent
Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention
layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the
latent space of these diffusion models significantly escalates resource
demands, leading to inconsistent results and necessitating numerous iterations
for a single desired image. Recently, ECLIPSE has demonstrated a more
resource-efficient pathway for training UnCLIP-based T2I models, circumventing
the need for diffusion text-to-image priors. Building on this, we introduce
lambda-ECLIPSE. Our method illustrates that effective P-T2I does not
necessarily depend on the latent space of diffusion models. lambda-ECLIPSE
achieves single, multi-subject, and edge-guided T2I personalization with just
34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text
interleaved data. Through extensive experiments, we also establish that
lambda-ECLIPSE surpasses existing baselines in composition alignment while
preserving concept alignment performance, even with significantly lower
resource utilization.