ChatPaper.aiChatPaper

UMoE:共享专家机制下的注意力与前馈网络统一框架

UMoE: Unifying Attention and FFN with Shared Experts

May 12, 2025
作者: Yuanhang Yang, Chaozheng Wang, Jing Li
cs.AI

摘要

稀疏專家混合(MoE)架構已成為擴展Transformer模型的一種有前景的方法。雖然最初的研究主要將MoE整合到前饋網絡(FFN)層中,但最近的研究探索了將MoE範式擴展到注意力層以提升模型性能。然而,現有的基於注意力的MoE層需要專門的實現,並且與基於FFN的對應層相比表現欠佳。在本文中,我們旨在通過引入一種新穎的注意力機制重新表述,揭示注意力模塊內在的類似FFN的結構,從而統一注意力和FFN層中的MoE設計。我們提出的架構UMoE通過基於注意力的MoE層實現了卓越的性能,同時實現了FFN和注意力組件之間的高效參數共享。
English
Sparse Mixture of Experts (MoE) architectures have emerged as a promising approach for scaling Transformer models. While initial works primarily incorporated MoE into feed-forward network (FFN) layers, recent studies have explored extending the MoE paradigm to attention layers to enhance model performance. However, existing attention-based MoE layers require specialized implementations and demonstrate suboptimal performance compared to their FFN-based counterparts. In this paper, we aim to unify the MoE designs in attention and FFN layers by introducing a novel reformulation of the attention mechanism, revealing an underlying FFN-like structure within attention modules. Our proposed architecture, UMoE, achieves superior performance through attention-based MoE layers while enabling efficient parameter sharing between FFN and attention components.

Summary

AI-Generated Summary

PDF51May 13, 2025