通过μP高效扩展扩散变换器

Scaling Diffusion Transformers Efficiently via μP

May 21, 2025

作者: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

cs.AI

摘要

扩散Transformer已成为视觉生成模型的基础，但其可扩展性受到大规模超参数（HP）调优高成本的限制。最近，针对普通Transformer提出的最大更新参数化（muP）方法，实现了从小型到大型语言模型的稳定HP迁移，并显著降低了调优成本。然而，普通Transformer的muP是否适用于架构和目标不同的扩散Transformer仍不明确。在本研究中，我们将标准muP推广至扩散Transformer，并通过大规模实验验证其有效性。首先，我们严格证明了包括DiT、U-ViT、PixArt-alpha和MMDiT在内的主流扩散Transformer的muP与普通Transformer一致，使得现有muP方法可直接应用。基于这一结果，我们系统性地展示了DiT-muP具备稳健的HP可迁移性。值得注意的是，采用迁移学习率的DiT-XL-2-muP比原版DiT-XL-2实现了2.9倍的收敛速度提升。最后，我们通过在文本到图像生成任务中将PixArt-alpha从0.04B扩展至0.61B，以及将MMDiT从0.18B扩展至18B，验证了muP的有效性。在这两种情况下，采用muP的模型均超越了各自的基线，同时仅需极小的调优成本——PixArt-alpha仅需一次训练运行的5.5%，而MMDiT-18B的调优成本仅为人类专家消耗的3%。这些成果确立了muP作为扩展扩散Transformer的原则性且高效框架的地位。

English

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (muP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether muP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard muP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that muP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-alpha, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing muP methodologies. Leveraging this result, we systematically demonstrate that DiT-muP enjoys robust HP transferability. Notably, DiT-XL-2-muP with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of muP on text-to-image generation by scaling PixArt-alpha from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under muP outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-alpha and 3% of consumption by human experts for MMDiT-18B. These results establish muP as a principled and efficient framework for scaling diffusion Transformers.