DiT-Air:重新審視擴散模型架構在文本到圖像生成中的效率設計
DiT-Air: Revisiting the Efficiency of Diffusion Model Architecture Design in Text to Image Generation
March 13, 2025
作者: Chen Chen, Rui Qian, Wenze Hu, Tsu-Jui Fu, Lezhi Li, Bowen Zhang, Alex Schwing, Wei Liu, Yinfei Yang
cs.AI
摘要
在本研究中,我們對用於文本到圖像生成的擴散變換器(DiTs)進行了實證研究,重點探討了架構選擇、文本條件策略及訓練協議。我們評估了一系列基於DiT的架構——包括PixArt風格和MMDiT變體——並將其與直接處理串聯文本和噪聲輸入的標準DiT變體進行比較。令人驚訝的是,我們的研究結果顯示,標準DiT的性能與這些專門模型相當,同時展現出更優的參數效率,尤其是在規模擴大時。利用層級參數共享策略,我們相較於MMDiT架構進一步減少了66%的模型大小,且對性能影響極小。基於對文本編碼器和變分自編碼器(VAEs)等關鍵組件的深入分析,我們引入了DiT-Air和DiT-Air-Lite。通過監督和獎勵微調,DiT-Air在GenEval和T2I CompBench上達到了最先進的性能,而DiT-Air-Lite儘管體積緊湊,仍保持高度競爭力,超越了大多數現有模型。
English
In this work, we empirically study Diffusion Transformers (DiTs) for
text-to-image generation, focusing on architectural choices, text-conditioning
strategies, and training protocols. We evaluate a range of DiT-based
architectures--including PixArt-style and MMDiT variants--and compare them with
a standard DiT variant which directly processes concatenated text and noise
inputs. Surprisingly, our findings reveal that the performance of standard DiT
is comparable with those specialized models, while demonstrating superior
parameter-efficiency, especially when scaled up. Leveraging the layer-wise
parameter sharing strategy, we achieve a further reduction of 66% in model size
compared to an MMDiT architecture, with minimal performance impact. Building on
an in-depth analysis of critical components such as text encoders and
Variational Auto-Encoders (VAEs), we introduce DiT-Air and DiT-Air-Lite. With
supervised and reward fine-tuning, DiT-Air achieves state-of-the-art
performance on GenEval and T2I CompBench, while DiT-Air-Lite remains highly
competitive, surpassing most existing models despite its compact size.Summary
AI-Generated Summary