ELLA：拡散モデルにLLMを統合し、意味的整合性を強化する手法

要旨

拡散モデルは、テキストから画像生成の分野で顕著な性能を発揮してきました。しかし、広く使用されているモデルの多くは依然としてテキストエンコーダとしてCLIPを採用しており、これが複数のオブジェクト、詳細な属性、複雑な関係、長文の整合性などを含む高密度なプロンプトの理解能力を制限しています。本論文では、Efficient Large Language Model Adapter（ELLA）を提案します。ELLAは、U-NetやLLMの再学習を必要とせずに、テキストから画像生成の拡散モデルに強力な大規模言語モデル（LLM）を組み込むことで、テキストの整合性を向上させます。2つの事前学習済みモデルをシームレスに接続するために、我々はさまざまな意味的整合性コネクタの設計を検討し、LLMからタイムステップ依存の条件を動的に抽出する新しいモジュール、Timestep-Aware Semantic Connector（TSC）を提案します。我々のアプローチは、ノイズ除去プロセスの異なる段階で意味的特徴を適応させ、拡散モデルがサンプリングタイムステップにわたって長く複雑なプロンプトを解釈するのを支援します。さらに、ELLAはコミュニティモデルやツールに容易に組み込むことができ、それらのプロンプト追従能力を向上させます。高密度なプロンプト追従におけるテキストから画像生成モデルの評価のために、1,000の高密度プロンプトからなる挑戦的なベンチマーク、Dense Prompt Graph Benchmark（DPG-Bench）を導入します。広範な実験により、ELLAが最先端の手法と比較して、特に多様な属性や関係を含む複数のオブジェクトの構成において、高密度なプロンプト追従において優れていることが実証されました。

English

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

ELLA：拡散モデルにLLMを統合し、意味的整合性を強化する手法

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

要旨

Support