エキスパートの連合：階層型ルーティングを等価分解されたTransformerに適応

要旨

Mixture-of-Experts (MoE) は、計算効率を維持しながらモデルの性能を向上させるため、大規模なアプリケーションに適しています。しかし、既存のMoEパラダイムでは、各エキスパートが個別に動作するため、高品質なエキスパート間の相互作用が欠如しています。さらに、これらはアテンションブロックに効果的に拡張されておらず、さらなる効率改善が制約されています。これらの課題に対処するため、我々はUnion-of-Experts (UoE) を提案します。UoEは、トランスフォーマーを等価なエキスパートグループに分解し、入力データとエキスパートに対して動的ルーティングを実装します。我々のアプローチは、以下の3つの主要な革新によりMoE設計を進化させます：(1) テンソル並列処理における行列分割に基づき、MLPブロックとアテンションブロックの両方で等価なエキスパート分解を実施しました。(2) パッチ単位のデータ選択とエキスパート選択という2つのルーティングパラダイムを開発し、異なるレベルでルーティングを適用します。(3) Selective Multi-Head Attention (SMHA) と Union-of-MLP-Experts (UoME) を含むUoEモデルのアーキテクチャを設計しました。(4) UoEのルーティングと計算操作の並列実装を開発し、ハードウェア処理分析に基づいて効率を最適化しました。実験結果は、UoEを採用したモデルが、画像および自然言語領域の複数のタスクにおいて、Full Attention、最先端のMoE、および効率的なトランスフォーマーを凌駕することを示しています。ソースコードは https://github.com/YujiaoYang-work/UoE で公開されています。

English

Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.

エキスパートの連合：階層型ルーティングを等価分解されたTransformerに適応

Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer

要旨

Support