ChatPaper.aiChatPaper

探讨大型语言模型在提示编码中对扩散模型的作用

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

June 17, 2024
作者: Bingqi Ma, Zhuofan Zong, Guanglu Song, Hongsheng Li, Yu Liu
cs.AI

摘要

基于仅解码器的大型语言模型(LLMs)已经展示出比CLIP和T5系列模型更优越的文本理解能力。然而,如何在文本到图像扩散模型中利用当前先进的LLMs的范式仍有待探索。我们观察到一个不寻常的现象:直接将大型语言模型用作提示编码器会显著降低图像生成中的提示跟随能力。我们确定了这个问题背后的两个主要障碍。一个是LLM中下一个标记预测训练与扩散模型中所需的区分性提示特征之间的不一致。另一个是解码器-仅架构引入的固有位置偏见。为了解决这个问题,我们提出了一个新颖的框架,充分利用LLMs的能力。通过精心设计的使用指南,我们有效增强了提示编码的文本表示能力,并消除了其固有的位置偏见。这使我们能够灵活地将最先进的LLMs集成到文本到图像生成模型中。此外,我们还提供了一种有效的方法将多个LLMs融入我们的框架中。考虑到变压器架构展示的出色性能和扩展能力,我们进一步设计了基于该框架的LLM-注入扩散变压器(LI-DiT)。我们进行了大量实验来验证LI-DiT在模型大小和数据大小上的性能。由于LLMs的固有能力和我们的创新设计,LI-DiT的提示理解性能轻松超越了最先进的开源模型以及主流的闭源商业模型,包括Stable Diffusion 3、DALL-E 3和Midjourney V6。经过进一步优化和安全检查,强大的LI-DiT-10B将很快推出。
English
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available after further optimization and security checks.

Summary

AI-Generated Summary

PDF224December 6, 2024