UMoE:共享专家机制下的注意力与前馈网络统一框架
UMoE: Unifying Attention and FFN with Shared Experts
May 12, 2025
作者: Yuanhang Yang, Chaozheng Wang, Jing Li
cs.AI
摘要
稀疏專家混合(MoE)架構已成為擴展Transformer模型的一種有前景的方法。雖然最初的研究主要將MoE整合到前饋網絡(FFN)層中,但最近的研究探索了將MoE範式擴展到注意力層以提升模型性能。然而,現有的基於注意力的MoE層需要專門的實現,並且與基於FFN的對應層相比表現欠佳。在本文中,我們旨在通過引入一種新穎的注意力機制重新表述,揭示注意力模塊內在的類似FFN的結構,從而統一注意力和FFN層中的MoE設計。我們提出的架構UMoE通過基於注意力的MoE層實現了卓越的性能,同時實現了FFN和注意力組件之間的高效參數共享。
English
Sparse Mixture of Experts (MoE) architectures have emerged as a promising
approach for scaling Transformer models. While initial works primarily
incorporated MoE into feed-forward network (FFN) layers, recent studies have
explored extending the MoE paradigm to attention layers to enhance model
performance. However, existing attention-based MoE layers require specialized
implementations and demonstrate suboptimal performance compared to their
FFN-based counterparts. In this paper, we aim to unify the MoE designs in
attention and FFN layers by introducing a novel reformulation of the attention
mechanism, revealing an underlying FFN-like structure within attention modules.
Our proposed architecture, UMoE, achieves superior performance through
attention-based MoE layers while enabling efficient parameter sharing between
FFN and attention components.Summary
AI-Generated Summary