透過μP高效擴展擴散變換器

Scaling Diffusion Transformers Efficiently via μP

May 21, 2025

作者: Chenyu Zheng, Xinyu Zhang, Rongzhen Wang, Wei Huang, Zhi Tian, Weilin Huang, Jun Zhu, Chongxuan Li

cs.AI

摘要

擴散變壓器（Diffusion Transformers）已成為視覺生成模型的基礎，但其可擴展性受到大規模超參數（HP）調優高成本的限制。最近，針對普通變壓器提出了最大更新參數化（muP），該方法能夠穩定地將超參數從小規模語言模型轉移到大型語言模型，並顯著降低調優成本。然而，尚不清楚普通變壓器的muP是否適用於擴散變壓器，因為兩者在架構和目標上存在差異。在本研究中，我們將標準muP推廣到擴散變壓器，並通過大規模實驗驗證其有效性。首先，我們嚴格證明了主流擴散變壓器（包括DiT、U-ViT、PixArt-alpha和MMDiT）的muP與普通變壓器一致，從而能夠直接應用現有的muP方法。基於這一結果，我們系統性地展示了DiT-muP具有強大的超參數可轉移性。值得注意的是，採用轉移學習率的DiT-XL-2-muP比原始DiT-XL-2實現了2.9倍的收斂速度提升。最後，我們通過將PixArt-alpha從0.04B擴展到0.61B，以及將MMDiT從0.18B擴展到18B，驗證了muP在文本到圖像生成中的有效性。在這兩種情況下，採用muP的模型均超越了各自的基線，同時僅需較小的調優成本，PixArt-alpha僅需一次訓練運行的5.5%，而MMDiT-18B僅需人類專家消耗的3%。這些結果確立了muP作為擴展擴散變壓器的原則性和高效框架。

English

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization (muP) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether muP of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize standard muP to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that muP of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-alpha, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing muP methodologies. Leveraging this result, we systematically demonstrate that DiT-muP enjoys robust HP transferability. Notably, DiT-XL-2-muP with transferred learning rate achieves 2.9 times faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of muP on text-to-image generation by scaling PixArt-alpha from 0.04B to 0.61B and MMDiT from 0.18B to 18B. In both cases, models under muP outperform their respective baselines while requiring small tuning cost, only 5.5% of one training run for PixArt-alpha and 3% of consumption by human experts for MMDiT-18B. These results establish muP as a principled and efficient framework for scaling diffusion Transformers.