将扩散Transformer扩展至160亿参数
Scaling Diffusion Transformers to 16 Billion Parameters
July 16, 2024
作者: Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, Junshi Huang
cs.AI
摘要
本文介绍了DiT-MoE,这是扩展Transformer的稀疏版本,可扩展且与密集网络竞争力强,同时展现高度优化的推断能力。DiT-MoE包括两个简单设计:共享专家路由和专家级平衡损失,从而捕获共同知识并减少不同路由专家之间的冗余。当应用于条件图像生成时,对专家专业化的深入分析得出一些有趣的观察:(i) 专家选择显示出对空间位置和去噪时间步的偏好,对不同类别条件信息不敏感;(ii) 随着MoE层的加深,专家的选择逐渐从特定的空间位置转向分散性和平衡性;(iii) 专家专业化倾向于在早期时间步更加集中,然后在后半部分逐渐均匀。我们将其归因于扩散过程,首先对低频空间信息进行建模,然后是高频复杂信息。根据上述指导,一系列DiT-MoE实验在推断过程中实现了与密集网络相当的性能,但需要更少的计算负载。更令人鼓舞的是,我们展示了DiT-MoE在合成图像数据上的潜力,将扩散模型扩展到165亿参数,在512×512分辨率设置下获得了新的SoTA FID-50K分数1.80。项目页面:https://github.com/feizc/DiT-MoE。
English
In this paper, we present DiT-MoE, a sparse version of the diffusion
Transformer, that is scalable and competitive with dense networks while
exhibiting highly optimized inference. The DiT-MoE includes two simple designs:
shared expert routing and expert-level balance loss, thereby capturing common
knowledge and reducing redundancy among the different routed experts. When
applied to conditional image generation, a deep analysis of experts
specialization gains some interesting observations: (i) Expert selection shows
preference with spatial position and denoising time step, while insensitive
with different class-conditional information; (ii) As the MoE layers go deeper,
the selection of experts gradually shifts from specific spacial position to
dispersion and balance. (iii) Expert specialization tends to be more
concentrated at the early time step and then gradually uniform after half. We
attribute it to the diffusion process that first models the low-frequency
spatial information and then high-frequency complex information. Based on the
above guidance, a series of DiT-MoE experimentally achieves performance on par
with dense networks yet requires much less computational load during inference.
More encouragingly, we demonstrate the potential of DiT-MoE with synthesized
image data, scaling diffusion model at a 16.5B parameter that attains a new
SoTA FID-50K score of 1.80 in 512times512 resolution settings. The project
page: https://github.com/feizc/DiT-MoE.Summary
AI-Generated Summary