探索大型语言模型与扩散变换器在文本到图像合成中的深度融合

摘要

本文并未阐述一种新方法，而是对近期文本至图像合成技术进展中一个重要却鲜有研究的设计领域进行了深入探讨——具体而言，即大型语言模型（LLMs）与扩散变换器（DiTs）深度融合以实现多模态生成。既往研究多聚焦于系统整体性能，而缺乏与替代方法的细致对比，且关键设计细节与训练方案常未公开，这些空白使得该方法的真实潜力存疑。为填补这些空白，我们开展了一项关于文本至图像生成的实证研究，通过与既定基线进行受控对比，分析重要设计决策，并提供一套清晰、可复现的大规模训练方案。我们期望本工作能为未来多模态生成研究提供有价值的数据参考与实践指导。

English

This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

探索大型语言模型与扩散变换器在文本到图像合成中的深度融合

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

摘要

Support