探索大型语言模型与扩散变换器在文本到图像合成中的深度融合
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
May 15, 2025
作者: Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie
cs.AI
摘要
本文并未阐述一种新方法,而是对近期文本至图像合成技术进展中一个重要却鲜有研究的设计领域进行了深入探讨——具体而言,即大型语言模型(LLMs)与扩散变换器(DiTs)深度融合以实现多模态生成。既往研究多聚焦于系统整体性能,而缺乏与替代方法的细致对比,且关键设计细节与训练方案常未公开,这些空白使得该方法的真实潜力存疑。为填补这些空白,我们开展了一项关于文本至图像生成的实证研究,通过与既定基线进行受控对比,分析重要设计决策,并提供一套清晰、可复现的大规模训练方案。我们期望本工作能为未来多模态生成研究提供有价值的数据参考与实践指导。
English
This paper does not describe a new method; instead, it provides a thorough
exploration of an important yet understudied design space related to recent
advances in text-to-image synthesis -- specifically, the deep fusion of large
language models (LLMs) and diffusion transformers (DiTs) for multi-modal
generation. Previous studies mainly focused on overall system performance
rather than detailed comparisons with alternative methods, and key design
details and training recipes were often left undisclosed. These gaps create
uncertainty about the real potential of this approach. To fill these gaps, we
conduct an empirical study on text-to-image generation, performing controlled
comparisons with established baselines, analyzing important design choices, and
providing a clear, reproducible recipe for training at scale. We hope this work
offers meaningful data points and practical guidelines for future research in
multi-modal generation.Summary
AI-Generated Summary