DiT-Air:重新审视扩散模型架构在文本到图像生成中的效率设计
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation
March 13, 2025
作者: Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
cs.AI
摘要
本研究对扩散变换器(DiTs)在文本到图像生成中的应用进行了实证研究,重点关注架构选择、文本条件策略及训练协议。我们评估了一系列基于DiT的架构——包括PixArt风格和MMDiT变体——并将其与直接处理拼接文本和噪声输入的标准DiT变体进行了比较。出乎意料的是,我们的研究结果显示,标准DiT的性能与这些专门模型相当,同时展现出更优的参数效率,尤其是在规模扩展时。通过采用层级参数共享策略,我们进一步将模型大小相较于MMDiT架构减少了66%,而对性能影响微乎其微。基于对文本编码器和变分自编码器(VAEs)等关键组件的深入分析,我们推出了DiT-Air和DiT-Air-Lite。经过监督和奖励微调,DiT-Air在GenEval和T2I CompBench上实现了最先进的性能,而DiT-Air-Lite尽管体积紧凑,仍保持高度竞争力,超越了大多数现有模型。
English
In this work, we empirically study Diffusion Transformers (DiTs) for
text-to-image generation, focusing on architectural choices, text-conditioning
strategies, and training protocols. We evaluate a range of DiT-based
architectures--including PixArt-style and MMDiT variants--and compare them with
a standard DiT variant which directly processes concatenated text and noise
inputs. Surprisingly, our findings reveal that the performance of standard DiT
is comparable with those specialized models, while demonstrating superior
parameter-efficiency, especially when scaled up. Leveraging the layer-wise
parameter sharing strategy, we achieve a further reduction of 66% in model size
compared to an MMDiT architecture, with minimal performance impact. Building on
an in-depth analysis of critical components such as text encoders and
Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With
supervised and reward fine-tuning, DiT-Air achieves state-of-the-art
performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly
competitive, surpassing most existing models despite its compact size.Summary
AI-Generated Summary