LDGen:透過大型語言模型驅動的語言表徵提升文本至圖像合成
LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation
February 25, 2025
作者: Pengzhi Li, Pengfei Yu, Zide Liu, Wei He, Xuhao Pan, Xudong Rao, Tao Wei, Wei Chen
cs.AI
摘要
本文介紹了LDGen,這是一種將大型語言模型(LLMs)整合到現有文本到圖像擴散模型中的新方法,同時最大限度地減少計算需求。傳統的文本編碼器,如CLIP和T5,在多語言處理方面存在局限性,阻礙了跨多種語言的圖像生成。我們通過利用LLMs的先進能力來應對這些挑戰。我們的方法採用了一種語言表示策略,該策略應用分層標題優化和人類指令技術來提取精確的語義信息。隨後,我們引入了一個輕量級適配器和跨模態精煉器,以促進LLMs與圖像特徵之間的高效特徵對齊和交互。LDGen減少了訓練時間,並實現了零本多語言圖像生成。實驗結果表明,我們的方法在提示遵循和圖像美學質量方面均超越了基準模型,同時無縫支持多種語言。項目頁面:https://zrealli.github.io/LDGen。
English
In this paper, we introduce LDGen, a novel method for integrating large
language models (LLMs) into existing text-to-image diffusion models while
minimizing computational demands. Traditional text encoders, such as CLIP and
T5, exhibit limitations in multilingual processing, hindering image generation
across diverse languages. We address these challenges by leveraging the
advanced capabilities of LLMs. Our approach employs a language representation
strategy that applies hierarchical caption optimization and human instruction
techniques to derive precise semantic information,. Subsequently, we
incorporate a lightweight adapter and a cross-modal refiner to facilitate
efficient feature alignment and interaction between LLMs and image features.
LDGen reduces training time and enables zero-shot multilingual image
generation. Experimental results indicate that our method surpasses baseline
models in both prompt adherence and image aesthetic quality, while seamlessly
supporting multiple languages. Project page: https://zrealli.github.io/LDGen.Summary
AI-Generated Summary