ELLA:為增強語義對齊而配備LLM的擴散模型
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
March 8, 2024
作者: Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, Gang Yu
cs.AI
摘要
擴散模型在文本到圖像生成領域展現出卓越的性能。然而,大多數廣泛使用的模型仍然採用 CLIP 作為其文本編碼器,這限制了它們理解密集提示(包括多個對象、詳細屬性、複雜關係、長文本對齊等)的能力。本文介紹了一種高效大型語言模型適配器,稱為 ELLA,它為文本到圖像擴散模型配備了強大的大型語言模型(LLM),以增強文本對齊,而無需對 U-Net 或 LLM 進行訓練。為了無縫地連接兩個預訓練模型,我們研究了一系列語義對齊連接器設計,並提出了一個新的模塊,即時間步感知語義連接器(TSC),它可以動態地從 LLM 中提取時間步相關條件。我們的方法在去噪過程的不同階段調適語義特徵,幫助擴散模型在採樣時間步上解釋冗長和複雜的提示。此外,ELLA 可輕鬆與社區模型和工具結合,以提高它們的提示跟隨能力。為了評估在密集提示跟隨方面的文本到圖像模型,我們引入了密集提示圖形基準(DPG-Bench),這是一個包含 1K 密集提示的具有挑戰性的基準。大量實驗證明了ELLA在密集提示跟隨方面優於最先進的方法,特別是在涉及多個對象組合、不同屬性和關係的情況下。
English
Diffusion models have demonstrated remarkable performance in the domain of
text-to-image generation. However, most widely used models still employ CLIP as
their text encoder, which constrains their ability to comprehend dense prompts,
encompassing multiple objects, detailed attributes, complex relationships,
long-text alignment, etc. In this paper, we introduce an Efficient Large
Language Model Adapter, termed ELLA, which equips text-to-image diffusion
models with powerful Large Language Models (LLM) to enhance text alignment
without training of either U-Net or LLM. To seamlessly bridge two pre-trained
models, we investigate a range of semantic alignment connector designs and
propose a novel module, the Timestep-Aware Semantic Connector (TSC), which
dynamically extracts timestep-dependent conditions from LLM. Our approach
adapts semantic features at different stages of the denoising process,
assisting diffusion models in interpreting lengthy and intricate prompts over
sampling timesteps. Additionally, ELLA can be readily incorporated with
community models and tools to improve their prompt-following capabilities. To
assess text-to-image models in dense prompt following, we introduce Dense
Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K
dense prompts. Extensive experiments demonstrate the superiority of ELLA in
dense prompt following compared to state-of-the-art methods, particularly in
multiple object compositions involving diverse attributes and relationships.