ELLA: 향상된 의미 정렬을 위해 LLM을 장착한 확산 모델

초록

디퓨전 모델은 텍스트-이미지 생성 분야에서 뛰어난 성능을 입증해 왔습니다. 그러나 가장 널리 사용되는 모델들은 여전히 CLIP을 텍스트 인코더로 사용하고 있어, 다중 객체, 세부 속성, 복잡한 관계, 장문 정렬 등을 포함하는 밀집 프롬프트를 이해하는 데 제약이 있습니다. 본 논문에서는 Efficient Large Language Model Adapter(ELLA)를 소개합니다. ELLA는 텍스트-이미지 디퓨전 모델에 강력한 대형 언어 모델(LLM)을 장착하여 U-Net이나 LLM의 학습 없이도 텍스트 정렬을 강화합니다. 두 개의 사전 학습된 모델을 원활하게 연결하기 위해, 우리는 다양한 의미론적 정렬 커넥터 설계를 연구하고, LLM에서 시간 단계에 따라 조건을 동적으로 추출하는 새로운 모듈인 Timestep-Aware Semantic Connector(TSC)를 제안합니다. 우리의 접근 방식은 디노이징 과정의 다양한 단계에서 의미론적 특징을 적응시켜, 디퓨전 모델이 샘플링 시간 단계에 걸쳐 길고 복잡한 프롬프트를 해석하는 데 도움을 줍니다. 또한, ELLA는 커뮤니티 모델 및 도구와 쉽게 통합되어 프롬프트 추적 능력을 향상시킬 수 있습니다. 텍스트-이미지 모델의 밀집 프롬프트 추적 능력을 평가하기 위해, 우리는 1K개의 밀집 프롬프트로 구성된 도전적인 벤치마크인 Dense Prompt Graph Benchmark(DPG-Bench)를 도입했습니다. 광범위한 실험을 통해 ELLA가 최신 방법들에 비해 밀집 프롬프트 추적에서 우수함을 입증했으며, 특히 다양한 속성과 관계를 포함하는 다중 객체 구성에서 뛰어난 성능을 보였습니다.

English

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

ELLA: 향상된 의미 정렬을 위해 LLM을 장착한 확산 모델

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

초록

Support