MoE路由机制的重要性:通过显式路由引导扩展扩散Transformer模型
Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance
October 28, 2025
作者: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan
cs.AI
摘要
专家混合模型(MoE)已成为扩展模型容量同时保持计算效率的重要范式。尽管该范式在大型语言模型(LLM)中取得显著成功,但现有将MoE应用于扩散变换器(DiT)的尝试收效有限。我们认为这一差距源于语言令牌与视觉令牌的根本差异:语言令牌具有语义密集性和显著的令牌间差异性,而视觉令牌则存在空间冗余和功能异质性,阻碍了视觉MoE中的专家专业化。为此,我们提出ProMoE框架,其配备具有显式路由指导的双级路由器以促进专家专业化。具体而言,该框架通过条件路由根据功能角色将图像令牌划分为条件集和无条件集,并借助基于语义内容可学习原型的原型路由,优化条件图像令牌的分配策略。此外,原型路由实现的潜在空间基于相似度的专家分配,为引入显式语义指导提供了天然机制,我们验证了此类指导对视觉MoE至关重要。基于此,我们提出路由对比损失函数,显式增强原型路由过程,促进专家内部一致性与专家间多样性。在ImageNet基准上的大量实验表明,ProMoE在整流流和DDPM两种训练目标下均超越现有最先进方法。代码与模型将公开发布。
English
Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model
capacity while preserving computational efficiency. Despite its notable success
in large language models (LLMs), existing attempts to apply MoE to Diffusion
Transformers (DiTs) have yielded limited gains. We attribute this gap to
fundamental differences between language and visual tokens. Language tokens are
semantically dense with pronounced inter-token variation, while visual tokens
exhibit spatial redundancy and functional heterogeneity, hindering expert
specialization in vision MoE. To this end, we present ProMoE, an MoE framework
featuring a two-step router with explicit routing guidance that promotes expert
specialization. Specifically, this guidance encourages the router to partition
image tokens into conditional and unconditional sets via conditional routing
according to their functional roles, and refine the assignments of conditional
image tokens through prototypical routing with learnable prototypes based on
semantic content. Moreover, the similarity-based expert allocation in latent
space enabled by prototypical routing offers a natural mechanism for
incorporating explicit semantic guidance, and we validate that such guidance is
crucial for vision MoE. Building on this, we propose a routing contrastive loss
that explicitly enhances the prototypical routing process, promoting
intra-expert coherence and inter-expert diversity. Extensive experiments on
ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods
under both Rectified Flow and DDPM training objectives. Code and models will be
made publicly available.