探索大型語言模型在提示編碼中對擴散模型的作用
Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
June 17, 2024
作者: Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu
cs.AI
摘要
基於僅解碼器的大型語言模型(LLMs)展現出比CLIP和T5系列模型更優越的文本理解能力。然而,目前尚未探索如何將當前先進的LLMs應用於文本到圖像擴散模型的範式。我們觀察到一個不尋常的現象:直接將大型語言模型用作提示編碼器會顯著降低圖像生成中的提示跟隨能力。我們確認了這個問題背後的兩個主要障礙。一個是LLM中下一個令牌預測訓練與擴散模型中需要的有區分性提示特徵之間的不一致。另一個是解碼器專用架構引入的固有位置偏差。為了應對這個問題,我們提出了一個新穎的框架,充分利用LLMs的能力。通過精心設計的使用指南,我們有效增強了提示編碼的文本表示能力,並消除了其固有的位置偏差。這使我們能夠靈活地將最先進的LLMs整合到文本到圖像生成模型中。此外,我們還提供了一種有效的方式將多個LLMs融入我們的框架中。考慮到變壓器架構展示的出色性能和擴展能力,我們進一步基於該框架設計了一個LLM-注入擴散變壓器(LI-DiT)。我們進行了廣泛的實驗,驗證了LI-DiT在模型大小和數據大小上的表現。由於LLMs的固有能力和我們的創新設計,LI-DiT的提示理解性能輕鬆超越了最先進的開源模型以及主流的封閉商業模型,包括Stable Diffusion 3、DALL-E 3和Midjourney V6。功能強大的LI-DiT-10B將在進一步優化和安全檢查後提供。
English
Large language models (LLMs) based on decoder-only transformers have
demonstrated superior text understanding capabilities compared to CLIP and
T5-series models. However, the paradigm for utilizing current advanced LLMs in
text-to-image diffusion models remains to be explored. We observed an unusual
phenomenon: directly using a large language model as the prompt encoder
significantly degrades the prompt-following ability in image generation. We
identified two main obstacles behind this issue. One is the misalignment
between the next token prediction training in LLM and the requirement for
discriminative prompt features in diffusion models. The other is the intrinsic
positional bias introduced by the decoder-only architecture. To deal with this
issue, we propose a novel framework to fully harness the capabilities of LLMs.
Through the carefully designed usage guidance, we effectively enhance the text
representation capability for prompt encoding and eliminate its inherent
positional bias. This allows us to integrate state-of-the-art LLMs into the
text-to-image generation model flexibly. Furthermore, we also provide an
effective manner to fuse multiple LLMs into our framework. Considering the
excellent performance and scaling capabilities demonstrated by the transformer
architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT)
based on the framework. We conduct extensive experiments to validate LI-DiT
across model size and data size. Benefiting from the inherent ability of the
LLMs and our innovative designs, the prompt understanding performance of LI-DiT
easily surpasses state-of-the-art open-source models as well as mainstream
closed-source commercial models including Stable Diffusion 3, DALL-E 3, and
Midjourney V6. The powerful LI-DiT-10B will be available after further
optimization and security checks.Summary
AI-Generated Summary