DynaMoE: 혼합 전문가 신경망을 위한 계층별 적응적 용량을 지닌 동적 토큰 수준 전문가 활성화

초록

MoE(Mixture-of-Experts) 아키텍처는 계산 효율성을 유지하면서 신경망 규모를 확장하기 위한 강력한 패러다임으로 부상했습니다. 그러나 표준 MoE 구현은 두 가지 경직된 설계 가정에 의존합니다: (1) 토큰당 정확히 K명의 전문가를 활성화하는 고정 Top-K 라우팅, (2) 모든 계층에 걸친 균일한 전문가 할당. 본 논문은 동적 토큰 수준 전문가 활성화와 계층별 적응형 용량 할당을 통해 이 두 가지 제약을 모두 완화하는 새로운 MoE 프레임워크인 DynaMoE를 소개합니다. DynaMoE는 입력 복잡성에 따라 토큰당 활성 전문가 수가 변하는 원칙 기반 라우팅 메커니즘을 도입합니다. 동시에 이 프레임워크는 하강형, 상승형, 피라미드형, 파동형 패턴을 포함하여 네트워크 깊이에 걸쳐 전문가 용량을 분배하는 여섯 가지 상이한 스케줄링 전략을 구현합니다. 우리는 동적 라우팅의 표현력 향상 이점을 이론적으로 분석하고 계산 효율성의 한계를 도출합니다. 다양한 모델 규모에서 MNIST, Fashion-MNIST, CIFAR-10(이미지 분류) 및 Recycling-the-Web(언어 모델링)에 대한 광범위한 실험을 통해 DynaMoE가 정적 기준선 대비 우수한 매개변수 효율성을 달성함을 입증합니다. 우리의 핵심 발견은 최적의 전문가 스케줄이 작업 및 규모에 의존적이라는 점입니다: 하강형 스케줄(초기 계층에 용량 집중)은 이미지 분류에서 균일 기준선을 능가합니다. 언어 모델링의 경우, 최적 스케줄은 모델 크기에 따라 다르며, Tiny 모델에는 하강형, Small 모델에는 상승형, Medium 모델에는 균일형이 적합합니다. 더 나아가 동적 라우팅은 훈련 중 그래디언트 분산을 줄여 수렴 안정성을 개선합니다. DynaMoE는 신경망에서 적응형 계산을 위한 새로운 프레임워크를 구축하며, MoE 아키텍처 설계에 원칙적인 지침을 제공합니다.

English

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

DynaMoE: 혼합 전문가 신경망을 위한 계층별 적응적 용량을 지닌 동적 토큰 수준 전문가 활성화

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

초록

Support