UMoE:通过共享专家统一注意力与前馈网络
UMoE: Unifying Attention and FFN with Shared Experts
May 12, 2025
作者: Yuanhang Yang, Chaozheng Wang, Jing Li
cs.AI
摘要
稀疏专家混合(MoE)架构已成为扩展Transformer模型的一种有前景的方法。尽管初期工作主要将MoE融入前馈网络(FFN)层,但近期研究探索了将MoE范式扩展至注意力层以提升模型性能。然而,现有的基于注意力的MoE层需要专门的实现,并且与基于FFN的对应层相比,表现出次优的性能。本文旨在通过引入一种新颖的注意力机制重构,揭示注意力模块中潜在的类似FFN的结构,从而统一注意力层和FFN层中的MoE设计。我们提出的架构UMoE,在实现基于注意力的MoE层的同时,通过FFN与注意力组件间的高效参数共享,达到了卓越的性能。
English
Sparse Mixture of Experts (MoE) architectures have emerged as a promising
approach for scaling Transformer models. While initial works primarily
incorporated MoE into feed-forward network (FFN) layers, recent studies have
explored extending the MoE paradigm to attention layers to enhance model
performance. However, existing attention-based MoE layers require specialized
implementations and demonstrate suboptimal performance compared to their
FFN-based counterparts. In this paper, we aim to unify the MoE designs in
attention and FFN layers by introducing a novel reformulation of the attention
mechanism, revealing an underlying FFN-like structure within attention modules.
Our proposed architecture, UMoE, achieves superior performance through
attention-based MoE layers while enabling efficient parameter sharing between
FFN and attention components.Summary
AI-Generated Summary