OmniMoE:一种通过大规模编排原子化专家的高效混合专家模型
OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale
February 5, 2026
作者: Jingze Shi, Zhangyang Peng, Yizhang Zhu, Yifan Wu, Guang Liu, Yuyu Luo
cs.AI
摘要
专家混合模型(MoE)架构正朝着更精细的粒度演进以提升参数效率。然而现有MoE设计面临专家 specialization 粒度与硬件执行效率之间的固有权衡。我们提出OmniMoE——一个系统算法协同设计的框架,将专家粒度推向逻辑极致。该框架引入向量级原子专家,在单一MoE层内实现可扩展的路由与执行,同时保留共享的稠密MLP分支进行通用处理。尽管这种原子化设计最大化了模型容量,但给路由复杂度和内存访问带来严峻挑战。为此,OmniMoE采用系统算法协同设计:(1)笛卡尔乘积路由器将海量索引空间分解,使路由复杂度从O(N)降至O(√N);(2)以专家为中心的重调度机制通过反转执行顺序,将分散的内存受限查找转化为高效的稠密矩阵运算。在七项基准测试中,OmniMoE(激活参数17亿)零样本准确率达50.9%,优于粗粒度(如DeepSeekMoE)和细粒度(如PEER)基线。关键的是,相比PEER模型,OmniMoE将推理延迟从73毫秒降至6.7毫秒(加速10.9倍),证明海量细粒度MoE可实现快速精准推理。代码已开源于https://github.com/flash-algo/omni-moe。
English
Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.