將擴散Transformer擴展至160億個參數

摘要

本文介紹了 DiT-MoE，這是擴散 Transformer 的稀疏版本，具有可擴展性，與密集網絡相競爭，同時展現高度優化的推論。DiT-MoE 包括兩個簡單的設計：共享專家路由和專家級平衡損失，從而捕捉共同知識，減少不同路由專家之間的冗餘。當應用於條件圖像生成時，對專家專業化的深入分析得出一些有趣的觀察：(i) 專家選擇對空間位置和降噪時間步顯示偏好，對不同類條件信息不敏感；(ii) 隨著 MoE 層的加深，專家的選擇逐漸從特定空間位置轉向分散和平衡。(iii) 專家專業化傾向於在早期時間步驟更集中，然後在一半後逐漸均勻。我們將其歸因於擴散過程，首先對低頻空間信息進行建模，然後是高頻複雜信息。基於上述指導，一系列 DiT-MoE 實驗在推論期間實現了與密集網絡相當的性能，但需要更少的計算負載。更令人鼓舞的是，我們展示了 DiT-MoE 在合成圖像數據上的潛力，將擴散模型擴展到 16.5B 參數，在 512x512 解析度設置下實現了新的 SoTA FID-50K 得分為 1.80。項目頁面：https://github.com/feizc/DiT-MoE。

English

In this paper, we present DiT-MoE, a sparse version of the diffusion Transformer, that is scalable and competitive with dense networks while exhibiting highly optimized inference. The DiT-MoE includes two simple designs: shared expert routing and expert-level balance loss, thereby capturing common knowledge and reducing redundancy among the different routed experts. When applied to conditional image generation, a deep analysis of experts specialization gains some interesting observations: (i) Expert selection shows preference with spatial position and denoising time step, while insensitive with different class-conditional information; (ii) As the MoE layers go deeper, the selection of experts gradually shifts from specific spacial position to dispersion and balance. (iii) Expert specialization tends to be more concentrated at the early time step and then gradually uniform after half. We attribute it to the diffusion process that first models the low-frequency spatial information and then high-frequency complex information. Based on the above guidance, a series of DiT-MoE experimentally achieves performance on par with dense networks yet requires much less computational load during inference. More encouragingly, we demonstrate the potential of DiT-MoE with synthesized image data, scaling diffusion model at a 16.5B parameter that attains a new SoTA FID-50K score of 1.80 in 512times512 resolution settings. The project page: https://github.com/feizc/DiT-MoE.

將擴散Transformer擴展至160億個參數

Scaling Diffusion Transformers to 16 Billion Parameters

摘要

Support