LDGen: 대규모 언어 모델 기반 언어 표현을 통한 텍스트-이미지 합성 향상

초록

본 논문에서는 기존의 텍스트-이미지 확산 모델에 대규모 언어 모델(LLM)을 통합하면서도 계산적 요구를 최소화하는 새로운 방법인 LDGen을 소개합니다. CLIP 및 T5와 같은 기존의 텍스트 인코더는 다국어 처리에 있어 한계를 보이며, 다양한 언어 간의 이미지 생성을 방해합니다. 우리는 이러한 문제를 해결하기 위해 LLM의 고급 기능을 활용합니다. 우리의 접근 방식은 계층적 캡션 최적화와 인간 지시 기법을 적용하여 정확한 의미 정보를 도출하는 언어 표현 전략을 사용합니다. 이후, 경량 어댑터와 크로스 모달 리파이너를 도입하여 LLM과 이미지 특성 간의 효율적인 특성 정렬 및 상호 작용을 가능하게 합니다. LDGen은 학습 시간을 단축시키고 제로샷 다국어 이미지 생성을 가능하게 합니다. 실험 결과, 우리의 방법은 프롬프트 준수도와 이미지 미적 품질 모두에서 기준 모델을 능가하며, 여러 언어를 원활하게 지원합니다. 프로젝트 페이지: https://zrealli.github.io/LDGen.

English

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.

LDGen: 대규모 언어 모델 기반 언어 표현을 통한 텍스트-이미지 합성 향상

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

초록

Support