ChatPaper.aiChatPaper

DynaMoE:面向混合专家神经网络的可逐层自适应容量的动态令牌级专家激活机制

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

March 2, 2026
作者: Gökdeniz Gülmez
cs.AI

摘要

专家混合(MoE)架构已成为在保持计算效率的同时扩展神经网络能力的重要范式。然而,传统MoE实现依赖两个刚性设计假设:(1)采用固定Top-K路由机制,每个令牌始终激活K个专家;(2)所有网络层采用均匀的专家分配策略。本文提出DynaMoE新型框架,通过动态令牌级专家激活与分层自适应容量分配,突破了这两项约束。DynaMoE引入基于输入复杂度动态调整单个令牌激活专家数量的原理性路由机制,同时构建了六种跨网络深度的专家容量调度策略(包括递减型、递增型、金字塔型和波动型)。我们理论分析了动态路由的表达能力增益,并推导了计算效率的边界。通过在MNIST、Fashion-MNIST、CIFAR-10(图像分类)和Recycling-the-Web(语言建模)数据集上开展多规模模型实验,证明DynaMoE相较于静态基线具有更优的参数效率。核心发现表明:最优专家调度策略具有任务与规模依赖性——图像分类任务中递减型调度(将容量集中于浅层)优于均匀基线;语言建模任务的最优策略随模型规模变化(Tiny模型适用递减型,Small模型适用递增型,Medium模型适用均匀型)。动态路由机制还能降低训练过程中的梯度方差,提升收敛稳定性。DynaMoE为神经网络自适应计算建立了新框架,为MoE架构设计提供了原理性指导。
English
Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.
PDF22May 8, 2026