ChatPaper.aiChatPaper

UMoE:通过共享专家统一注意力与前馈网络

UMoE: Unifying Attention and FFN with Shared Experts

May 12, 2025
作者: Yuanhang Yang, Chaozheng Wang, Jing Li
cs.AI

摘要

稀疏专家混合(MoE)架构已成为扩展Transformer模型的一种有前景的方法。尽管初期工作主要将MoE融入前馈网络(FFN)层,但近期研究探索了将MoE范式扩展至注意力层以提升模型性能。然而,现有的基于注意力的MoE层需要专门的实现,并且与基于FFN的对应层相比,表现出次优的性能。本文旨在通过引入一种新颖的注意力机制重构,揭示注意力模块中潜在的类似FFN的结构,从而统一注意力层和FFN层中的MoE设计。我们提出的架构UMoE,在实现基于注意力的MoE层的同时,通过FFN与注意力组件间的高效参数共享,达到了卓越的性能。
English
Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

Summary

AI-Generated Summary

PDF51May 13, 2025