專家聯盟：將分層路由機制應用於等效分解的Transformer模型

摘要

混合專家（Mixture-of-Experts, MoE）在保持計算效率的同時提升了模型性能，使其非常適合大規模應用。然而，現有MoE範式中的專家作為個體運作，因此缺乏高質量的專家互動。此外，它們尚未有效地擴展到注意力模塊，這限制了進一步的效率提升。為解決這些問題，我們提出了專家聯盟（Union-of-Experts, UoE），將Transformer分解為一組等價的專家，並在輸入數據和專家上實現動態路由。我們的方法通過三個關鍵創新推動了MoE設計：(1) 我們基於張量並行中的矩陣劃分，對MLP模塊和注意力模塊進行了等價的專家分解。(2) 我們開發了兩種路由範式：基於數據塊的選擇和專家選擇，以在不同層次上應用路由。(3) 我們設計了UoE模型的架構，包括選擇性多頭注意力（Selective Multi-Head Attention, SMHA）和MLP專家聯盟（Union-of-MLP-Experts, UoME）。(4) 我們開發了UoE路由和計算操作的並行實現，並基於硬件處理分析優化了效率。實驗表明，採用UoE的模型在多個圖像和自然語言任務中超越了全注意力模型、最先進的MoE和高效Transformer。源代碼可在https://github.com/YujiaoYang-work/UoE獲取。

English

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

專家聯盟：將分層路由機制應用於等效分解的Transformer模型

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

摘要

Support