OmniMoE: 原子エキスパート群の大規模オーケストレーションによる効率的なMoE

要旨

Mixture-of-Experts（MoE）アーキテクチャは、パラメータ効率を向上させるため、より細かい粒度へと進化している。しかし、既存のMoE設計には、専門家の特化粒度とハードウェア実行効率の間の本質的なトレードオフが存在する。本論文では、専門家の粒度を論理的な極限まで押し進める、システムとアルゴリズムの協調設計フレームワークであるOmniMoEを提案する。OmniMoEは、ベクトルレベルのAtomic Expertを導入し、単一のMoE層内でスケーラブルな経路選択と実行を可能にするとともに、汎用処理のための共有された密なMLP分岐を保持する。この原子的な設計は容量を最大化するが、経路選択の複雑さとメモリアクセスに深刻な課題をもたらす。これらに対処するため、OmniMoEはシステムとアルゴリズムの協調設計を採用する：(i) 大規模なインデックス空間を分解し、経路選択の複雑さをO(N)からO(√N)に削減する直積ルータ（Cartesian Product Router）、(ii) 実行順序を反転させ、散在するメモリ律速のルックアップを効率的な密行列演算に変換するExpert-Centric Schedulingである。7つのベンチマークで検証した結果、OmniMoE（活性化パラメータ17億）は、7つのベンチマーク全体で50.9%のゼロショット精度を達成し、粗粒度（DeepSeekMoEなど）および細粒度（PEERなど）のベースラインを上回った。決定的には、OmniMoEはPEERと比較して推論レイテンシを73msから6.7ms（10.9倍の高速化）に削減し、大規模な細粒度MoEが高速かつ高精度であり得ることを実証した。コードはhttps://github.com/flash-algo/omni-moe で公開している。

English

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.

OmniMoE: 原子エキスパート群の大規模オーケストレーションによる効率的なMoE

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

要旨

Support