ELLA：为增强语义对齐而为扩散模型配备LLM

摘要

扩散模型在文本到图像生成领域展现出了卓越的性能。然而，大多数广泛使用的模型仍然采用CLIP作为它们的文本编码器，这限制了它们理解密集提示、涵盖多个对象、详细属性、复杂关系、长文本对齐等能力。本文介绍了一种高效的大型语言模型适配器，称为ELLA，它为文本到图像扩散模型配备了强大的大型语言模型（LLM），以增强文本对齐，而无需对U-Net或LLM进行训练。为了无缝连接两个预训练模型，我们研究了一系列语义对齐连接器设计，并提出了一种新颖的模块，即时间步感知语义连接器（TSC），它可以动态地从LLM中提取时间步相关条件。我们的方法在去噪过程的不同阶段调整语义特征，帮助扩散模型在采样时间步上解释冗长和复杂的提示。此外，ELLA可以轻松与社区模型和工具结合，以提高它们的提示跟随能力。为了评估在密集提示跟随方面的文本到图像模型，我们引入了密集提示图基准（DPG-Bench），这是一个包含1K密集提示的具有挑战性的基准。广泛的实验表明，ELLA在密集提示跟随方面优于最先进的方法，特别是在涉及多个对象组合、不同属性和关系的情况下。

English

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

ELLA：为增强语义对齐而为扩散模型配备LLM

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

摘要

Support