λ-ECLIPSE：透過利用 CLIP 潛在空間的多概念個性化文本到圖像擴散模型。

摘要

儘管個性化文本到圖像（P-T2I）生成模型近年來取得了重大進展，但以主題驅動的T2I仍然具有挑戰性。主要瓶頸包括：1）需要大量的訓練資源、2）超參數敏感性導致輸出不一致，以及3）平衡新穎視覺概念和構圖對齊的複雜性。我們首先重申T2I擴散模型的核心理念，以解決上述限制。主要上，當代主題驅動的T2I方法依賴於潛在擴散模型（LDMs），透過交叉注意力層實現T2I映射。儘管LDMs具有明顯優勢，但P-T2I方法對這些擴散模型的潛在空間的依賴顯著增加了資源需求，導致結果不一致，需要進行多次迭代才能得到一個期望的圖像。最近，ECLIPSE展示了一種更節省資源的途徑，用於訓練基於UnCLIP的T2I模型，避免了對擴散文本到圖像先驗的需求。在此基礎上，我們介紹了lambda-ECLIPSE。我們的方法表明，有效的P-T2I並不一定依賴於擴散模型的潛在空間。lambda-ECLIPSE通過僅使用3400萬參數，在僅74個GPU小時內訓練，使用160萬圖像文本交錯數據，實現了單一、多主題和邊緣引導的T2I個性化。通過大量實驗，我們還確立了lambda-ECLIPSE在構圖對齊方面超越現有基準，同時保持概念對齊性能，即使資源利用明顯更低。

English

Despite the recent advances in personalized text-to-image (P-T2I) generative models, subject-driven T2I remains challenging. The primary bottlenecks include 1) Intensive training resource requirements, 2) Hyper-parameter sensitivity leading to inconsistent outputs, and 3) Balancing the intricacies of novel visual concept and composition alignment. We start by re-iterating the core philosophy of T2I diffusion models to address the above limitations. Predominantly, contemporary subject-driven T2I approaches hinge on Latent Diffusion Models (LDMs), which facilitate T2I mapping through cross-attention layers. While LDMs offer distinct advantages, P-T2I methods' reliance on the latent space of these diffusion models significantly escalates resource demands, leading to inconsistent results and necessitating numerous iterations for a single desired image. Recently, ECLIPSE has demonstrated a more resource-efficient pathway for training UnCLIP-based T2I models, circumventing the need for diffusion text-to-image priors. Building on this, we introduce lambda-ECLIPSE. Our method illustrates that effective P-T2I does not necessarily depend on the latent space of diffusion models. lambda-ECLIPSE achieves single, multi-subject, and edge-guided T2I personalization with just 34M parameters and is trained on a mere 74 GPU hours using 1.6M image-text interleaved data. Through extensive experiments, we also establish that lambda-ECLIPSE surpasses existing baselines in composition alignment while preserving concept alignment performance, even with significantly lower resource utilization.

λ-ECLIPSE：透過利用 CLIP 潛在空間的多概念個性化文本到圖像擴散模型。

λ-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space

摘要

Support