λ-ECLIPSE：利用CLIP潜空间的多概念个性化文本到图像扩散模型

摘要

尽管个性化文本到图像（P-T2I）生成模型近年来取得了进展，但基于主题的T2I仍然具有挑战性。主要瓶颈包括：1）需要大量训练资源，2）超参数敏感性导致输出不一致，3）平衡新颖视觉概念和构图对齐的复杂性。我们首先重新阐述了T2I扩散模型的核心理念，以解决上述限制。主要是，当代主题驱动的T2I方法依赖于潜在扩散模型（LDMs），通过交叉注意力层促进T2I映射。虽然LDMs具有明显优势，但P-T2I方法对这些扩散模型的潜在空间的依赖显著增加了资源需求，导致结果不一致，并需要多次迭代才能获得单个期望图像。最近，ECLIPSE展示了一种更节约资源的途径，用于训练基于UnCLIP的T2I模型，避免了对扩散文本到图像先验的需求。在此基础上，我们介绍了lambda-ECLIPSE。我们的方法表明，有效的P-T2I不一定依赖于扩散模型的潜在空间。lambda-ECLIPSE通过仅使用34M参数，在仅使用74个GPU小时的情况下，基于160万图像文本交错数据实现了单一、多主题和边缘引导的T2I个性化。通过大量实验，我们还确定lambda-ECLIPSE在构图对齐方面超越了现有基准线，同时保持概念对齐性能，即使资源利用明显较低。

English

Despite the recent advances in personalized text-to-image (P-T2I) generative models, subject-driven T2I remains challenging. The primary bottlenecks include 1) Intensive training resource requirements, 2) Hyper-parameter sensitivity leading to inconsistent outputs, and 3) Balancing the intricacies of novel visual concept and composition alignment. We start by re-iterating the core philosophy of T2I diffusion models to address the above limitations. Predominantly, contemporary subject-driven T2I approaches hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. Recently, ECLIPSE has demonstrated a more resource-efficient pathway for training UnCLIP-based T2I models, circumventing the need for diffusion text-to-image priors. Building on this, we introduce lambda-ECLIPSE. Our method illustrates that effective P-T2I does not necessarily depend on the latent space of diffusion models. lambda-ECLIPSE achieves single, multi-subject, and edge-guided T2I personalization with just 34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text interleaved data. Through extensive experiments, we also establish that lambda-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization.

λ-ECLIPSE：利用CLIP潜空间的多概念个性化文本到图像扩散模型

λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

摘要

Support