확산 모델을 위한 프롬프트 인코딩에서 대규모 언어 모델의 역할 탐구

초록

디코더 전용 트랜스포머 기반의 대규모 언어 모델(LLM)은 CLIP 및 T5 시리즈 모델에 비해 우수한 텍스트 이해 능력을 보여주었습니다. 그러나 현재의 고급 LLM을 텍스트-이미지 확산 모델에 활용하는 패러다임은 아직 탐구가 필요한 상태입니다. 우리는 한 가지 특이한 현상을 관찰했습니다: 대규모 언어 모델을 프롬프트 인코더로 직접 사용할 경우 이미지 생성에서의 프롬프트 추적 능력이 크게 저하된다는 것입니다. 이 문제의 배경에는 두 가지 주요 장애 요인이 있습니다. 하나는 LLM의 다음 토큰 예측 훈련과 확산 모델에서 요구되는 구별력 있는 프롬프트 특성 간의 불일치입니다. 다른 하나는 디코더 전용 아키텍처에서 발생하는 고유한 위치 편향입니다. 이 문제를 해결하기 위해, 우리는 LLM의 능력을 최대한 활용할 수 있는 새로운 프레임워크를 제안합니다. 신중하게 설계된 사용 지침을 통해, 우리는 프롬프트 인코딩을 위한 텍스트 표현 능력을 효과적으로 강화하고 고유한 위치 편향을 제거합니다. 이를 통해 최신 LLM을 텍스트-이미지 생성 모델에 유연하게 통합할 수 있습니다. 또한, 우리는 여러 LLM을 이 프레임워크에 융합하는 효과적인 방법도 제공합니다. 트랜스포머 아키텍처가 보여준 우수한 성능과 확장성을 고려하여, 우리는 이 프레임워크를 기반으로 한 LLM-Infused Diffusion Transformer(LI-DiT)를 추가로 설계했습니다. 우리는 모델 크기와 데이터 크기에 걸쳐 LI-DiT의 유효성을 검증하기 위해 광범위한 실험을 수행했습니다. LLM의 고유 능력과 우리의 혁신적인 설계 덕분에, LI-DiT의 프롬프트 이해 성능은 최신 오픈소스 모델뿐만 아니라 Stable Diffusion 3, DALL-E 3, Midjourney V6와 같은 주류의 폐쇄형 상용 모델을 쉽게 능가합니다. 강력한 LI-DiT-10B는 추가적인 최적화와 보안 검사를 거쳐 공개될 예정입니다.

English

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available after further optimization and security checks.

확산 모델을 위한 프롬프트 인코딩에서 대규모 언어 모델의 역할 탐구

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

초록

Support