DynaMoE：動的トークンレベル専門家活性化によるレイヤー適応容量を持つMixture-of-Expertsニューラルネットワーク

要旨

Mixture-of-Experts (MoE) アーキテクチャは、計算効率を維持しながらニューラルネットワークをスケーリングする強力なパラダイムとして登場した。しかし、標準的な MoE の実装は、2つの固定的な設計仮定に依存している：(1) トークンごとに厳密に K 個の専門家を活性化する固定 Top-K ルーティング、(2) 全層にわたる均一な専門家割り当てである。本論文は、動的なトークンレベルの専門家活性化と層ごとの適応的容量割り当てを通じて、これら両方の制約を緩和する新しい MoE フレームワークである DynaMoE を提案する。DynaMoE は、入力の複雑さに基づいてトークンごとの活性化専門家数が変化する、原理に基づいたルーティング機構を導入する。同時に、このフレームワークは、ネットワーク深度にわたって専門家容量を配分するための、下降型、上昇型、ピラミッド型、波型を含む6つの異なるスケジューリング戦略を実装する。我々は、動的ルーティングの表現力向上効果を理論的に分析し、計算効率に関する限界を導出する。MNIST、Fashion-MNIST、CIFAR-10（画像分類）、および Recycling-the-Web（言語モデリング）における複数のモデル規模にわたる広範な実験を通じて、DynaMoE が静的ベースラインと比較して優れたパラメータ効率を達成することを実証する。我々の主な発見は、最適な専門家スケジュールがタスクと規模に依存することである：画像分類では、下降型スケジュール（容量を初期層に集中）が均一ベースラインを上回る。言語モデリングでは、最適なスケジュールはモデルサイズによって変化し、Tiny では下降型、Small では上昇型、Medium では均一型が最適であった。さらに、動的ルーティングは訓練中の勾配分散を低減し、収束安定性の向上をもたらす。DynaMoE は、ニューラルネットワークにおける適応的計算の新たな枠組みを確立し、MoE アーキテクチャ設計に対する原理に基づいた指針を提供する。

English

Mixture-of-Experts (MoE) architectures have emerged as a powerful paradigm for scaling neural networks while maintaining computational efficiency. However, standard MoE implementations rely on two rigid design assumptions: (1) fixed Top-K routing where exactly K experts are activated per token, and (2) uniform expert allocation across all layers. This paper introduces DynaMoE, a novel MoE framework that relaxes both constraints through dynamic token-level expert activation and layer-wise adaptive capacity allocation. DynaMoE introduces a principled routing mechanism where the number of active experts per token varies based on input complexity. Concurrently, the framework implements six distinct scheduling strategies for distributing expert capacity across network depth, including descending, ascending, pyramid, and wave patterns. We theoretically analyze the expressivity gains of dynamic routing and derive bounds on computational efficiency. Through extensive experiments on MNIST, Fashion-MNIST, CIFAR-10 (image classification), and Recycling-the-Web (language modeling) across multiple model scales, we demonstrate that DynaMoE achieves superior parameter efficiency compared to static baselines. Our key finding is that optimal expert schedules are task- and scale-dependent: descending schedules (concentrating capacity in early layers) outperform uniform baselines on image classification. For language modeling, optimal schedules vary by model size, descending for Tiny, ascending for Small, and uniform for Medium. Furthermore, dynamic routing reduces gradient variance during training, leading to improved convergence stability. DynaMoE establishes a new framework for adaptive computation in neural networks, providing principled guidance for MoE architecture design.

DynaMoE：動的トークンレベル専門家活性化によるレイヤー適応容量を持つMixture-of-Expertsニューラルネットワーク

DynaMoE: Dynamic Token-Level Expert Activation with Layer-Wise Adaptive Capacity for Mixture-of-Experts Neural Networks

要旨

Support