探索大语言模型与扩散变换器的深度融合在文本到图像合成中的应用
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
May 15, 2025
作者: Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, Saining Xie
cs.AI
摘要
本文并未提出新方法,而是深入探讨了文本到图像合成领域一个重要却研究不足的设计空间——特别是大型语言模型(LLMs)与扩散变换器(DiTs)在多模态生成中的深度融合。以往研究多集中于整体系统性能,而缺乏与替代方法的详细对比,关键设计细节和训练方案也常未公开。这些空白使得该方法的真正潜力存在不确定性。为填补这些空白,我们对文本到图像生成进行了实证研究,与现有基线进行受控对比,分析重要设计选择,并提供了一套清晰、可复现的大规模训练方案。我们期望这项工作能为多模态生成的未来研究提供有价值的数据参考和实践指导。
English
This paper does not describe a new method; instead, it provides a thorough
exploration of an important yet understudied design space related to recent
advances in text-to-image synthesis -- specifically, the deep fusion of large
language models (LLMs) and diffusion transformers (DiTs) for multi-modal
generation. Previous studies mainly focused on overall system performance
rather than detailed comparisons with alternative methods, and key design
details and training recipes were often left undisclosed. These gaps create
uncertainty about the real potential of this approach. To fill these gaps, we
conduct an empirical study on text-to-image generation, performing controlled
comparisons with established baselines, analyzing important design choices, and
providing a clear, reproducible recipe for training at scale. We hope this work
offers meaningful data points and practical guidelines for future research in
multi-modal generation.Summary
AI-Generated Summary