UMoE：共享专家机制下的注意力与前馈网络统一框架

摘要

稀疏專家混合（MoE）架構已成為擴展Transformer模型的一種有前景的方法。雖然最初的研究主要將MoE整合到前饋網絡（FFN）層中，但最近的研究探索了將MoE範式擴展到注意力層以提升模型性能。然而，現有的基於注意力的MoE層需要專門的實現，並且與基於FFN的對應層相比表現欠佳。在本文中，我們旨在通過引入一種新穎的注意力機制重新表述，揭示注意力模塊內在的類似FFN的結構，從而統一注意力和FFN層中的MoE設計。我們提出的架構UMoE通過基於注意力的MoE層實現了卓越的性能，同時實現了FFN和注意力組件之間的高效參數共享。

English

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

UMoE：共享专家机制下的注意力与前馈网络统一框架

UMoE: Unifying Attention and FFN with Shared Experts

摘要

Support