LDGen：透過大型語言模型驅動的語言表徵提升文本至圖像合成

摘要

本文介紹了LDGen，這是一種將大型語言模型（LLMs）整合到現有文本到圖像擴散模型中的新方法，同時最大限度地減少計算需求。傳統的文本編碼器，如CLIP和T5，在多語言處理方面存在局限性，阻礙了跨多種語言的圖像生成。我們通過利用LLMs的先進能力來應對這些挑戰。我們的方法採用了一種語言表示策略，該策略應用分層標題優化和人類指令技術來提取精確的語義信息。隨後，我們引入了一個輕量級適配器和跨模態精煉器，以促進LLMs與圖像特徵之間的高效特徵對齊和交互。LDGen減少了訓練時間，並實現了零本多語言圖像生成。實驗結果表明，我們的方法在提示遵循和圖像美學質量方面均超越了基準模型，同時無縫支持多種語言。項目頁面：https://zrealli.github.io/LDGen。

English

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.

LDGen：透過大型語言模型驅動的語言表徵提升文本至圖像合成

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

摘要

Support