OmniMoE: 대규모 원자 단위 전문가 조율을 통한 효율적인 MoE

초록

MoE(Mixture-of-Experts) 아키텍처는 파라미터 효율성을 향상시키기 위해 더 세분화된 단위로 발전하고 있습니다. 그러나 기존 MoE 설계는 전문가 전문화의 세분화 정도와 하드웨어 실행 효율성 사이의 본질적 트레이드오프에 직면해 있습니다. 본 연구에서는 전문가 세분화를 논리적 극한까지 끌어올린 시스템-알고리즘 공동 설계 프레임워크인 OmniMoE를 제안합니다. OmniMoE는 벡터 수준의 원자적 전문가(Atomic Experts)를 도입하여 단일 MoE 계층 내에서 확장 가능한 라우팅 및 실행을 가능하게 하면서, 범용 처리를 위한 공유 조밀 MLP(Dense MLP) 분기를 유지합니다. 이러한 원자적 설계는 용량을 극대화하지만, 라우팅 복잡도와 메모리 접근에 심각한 문제를 제기합니다. 이를 해결하기 위해 OmniMoE는 시스템-알고리즘 공동 설계를 채택합니다: (i) 방대한 인덱스 공간을 분해하여 라우팅 복잡도를 O(N)에서 O(√N)으로 감소시키는 데카르트 곱 라우터(Cartesian Product Router), (ii) 실행 순서를 반전시켜 흩어져 있고 메모리 대역에 제한된 조회 작업을 효율적인 조밀 행렬 연산으로 전환하는 전문가 중심 스케줄링(Expert-Centric Scheduling). 7개 벤치마크에서 검증된 결과, OmniMoE(활성 파라미터 1.7B)는 7개 벤치마크에서 평균 50.9%의 제로샷 정확도를 달성하며, 거친 단위(coarse-grained, 예: DeepSeekMoE) 및 세밀한 단위(fine-grained, 예: PEER) 기준 모델들을 능가했습니다. 중요한 것은 OmniMoE가 PEER 대비 추론 지연 시간을 73ms에서 6.7ms로(10.9배 가속) 단축하여 대규모 세밀한 단위 MoE가 빠르고 정확할 수 있음을 입증했다는 점입니다. 본 코드는 https://github.com/flash-algo/omni-moe 에 공개되어 있습니다.

English

Mixture-of-Experts (MoE) architectures are evolving towards finer granularity to improve parameter efficiency. However, existing MoE designs face an inherent trade-off between the granularity of expert specialization and hardware execution efficiency. We propose OmniMoE, a system-algorithm co-designed framework that pushes expert granularity to its logical extreme. OmniMoE introduces vector-level Atomic Experts, enabling scalable routing and execution within a single MoE layer, while retaining a shared dense MLP branch for general-purpose processing. Although this atomic design maximizes capacity, it poses severe challenges for routing complexity and memory access. To address these, OmniMoE adopts a system-algorithm co-design: (i) a Cartesian Product Router that decomposes the massive index space to reduce routing complexity from O(N) to O(sqrt(N)); and (ii) Expert-Centric Scheduling that inverts the execution order to turn scattered, memory-bound lookups into efficient dense matrix operations. Validated on seven benchmarks, OmniMoE (with 1.7B active parameters) achieves 50.9% zero-shot accuracy across seven benchmarks, outperforming coarse-grained (e.g., DeepSeekMoE) and fine-grained (e.g., PEER) baselines. Crucially, OmniMoE reduces inference latency from 73ms to 6.7ms (a 10.9-fold speedup) compared to PEER, demonstrating that massive-scale fine-grained MoE can be fast and accurate. Our code is open-sourced at https://github.com/flash-algo/omni-moe.

OmniMoE: 대규모 원자 단위 전문가 조율을 통한 효율적인 MoE

OmniMoE: An Efficient MoE by Orchestrating Atomic Experts at Scale

초록

Support