λ-ECLIPSE:透過利用 CLIP 潛在空間的多概念個性化文本到圖像擴散模型。
λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
February 7, 2024
作者: Maitreya Patel, Sangmin Jung, Chitta Baral, Yezhou Yang
cs.AI
摘要
儘管個性化文本到圖像(P-T2I)生成模型近年來取得了重大進展,但以主題驅動的T2I仍然具有挑戰性。主要瓶頸包括:1)需要大量的訓練資源、2)超參數敏感性導致輸出不一致,以及3)平衡新穎視覺概念和構圖對齊的複雜性。我們首先重申T2I擴散模型的核心理念,以解決上述限制。主要上,當代主題驅動的T2I方法依賴於潛在擴散模型(LDMs),透過交叉注意力層實現T2I映射。儘管LDMs具有明顯優勢,但P-T2I方法對這些擴散模型的潛在空間的依賴顯著增加了資源需求,導致結果不一致,需要進行多次迭代才能得到一個期望的圖像。最近,ECLIPSE展示了一種更節省資源的途徑,用於訓練基於UnCLIP的T2I模型,避免了對擴散文本到圖像先驗的需求。在此基礎上,我們介紹了lambda-ECLIPSE。我們的方法表明,有效的P-T2I並不一定依賴於擴散模型的潛在空間。lambda-ECLIPSE通過僅使用3400萬參數,在僅74個GPU小時內訓練,使用160萬圖像文本交錯數據,實現了單一、多主題和邊緣引導的T2I個性化。通過大量實驗,我們還確立了lambda-ECLIPSE在構圖對齊方面超越現有基準,同時保持概念對齊性能,即使資源利用明顯更低。
English
Despite the recent advances in personalized text-to-image (P-T2I) generative
models, subject-driven T2I remains challenging. The primary bottlenecks include
1) Intensive training resource requirements, 2) Hyper-parameter sensitivity
leading to inconsistent outputs, and 3) Balancing the intricacies of novel
visual concept and composition alignment. We start by re-iterating the core
philosophy of T2I diffusion models to address the above limitations.
Predominantly, contemporary subject-driven T2I approaches hinge on Latent
Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention
layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the
latent space of these diffusion models significantly escalates resource
demands, leading to inconsistent results and necessitating numerous iterations
for a single desired image. Recently, ECLIPSE has demonstrated a more
resource-efficient pathway for training UnCLIP-based T2I models, circumventing
the need for diffusion text-to-image priors. Building on this, we introduce
lambda-ECLIPSE. Our method illustrates that effective P-T2I does not
necessarily depend on the latent space of diffusion models. lambda-ECLIPSE
achieves single, multi-subject, and edge-guided T2I personalization with just
34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text
interleaved data. Through extensive experiments, we also establish that
lambda-ECLIPSE surpasses existing baselines in composition alignment while
preserving concept alignment performance, even with significantly lower
resource utilization.