UMoE：通过共享专家统一注意力与前馈网络

摘要

稀疏专家混合（MoE）架构已成为扩展Transformer模型的一种有前景的方法。尽管初期工作主要将MoE融入前馈网络（FFN）层，但近期研究探索了将MoE范式扩展至注意力层以提升模型性能。然而，现有的基于注意力的MoE层需要专门的实现，并且与基于FFN的对应层相比，表现出次优的性能。本文旨在通过引入一种新颖的注意力机制重构，揭示注意力模块中潜在的类似FFN的结构，从而统一注意力层和FFN层中的MoE设计。我们提出的架构UMoE，在实现基于注意力的MoE层的同时，通过FFN与注意力组件间的高效参数共享，达到了卓越的性能。

English

Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

UMoE：通过共享专家统一注意力与前馈网络

UMoE: Unifying Attention and FFN with Shared Experts

摘要

Support