UniPool: エキスパート混合モデルのためのグローバル共有エキスパートプール

要旨

現代のMixture-of-Experts（MoE）アーキテクチャは、専門家キャパシティを層ごとに固定的に割り当てる規則を採用している。すなわち、各Transformer層が独立した専門家セットを保有する。この慣習は、深さのスケーリングと専門家パラメータの線形的な増加を結びつけ、すべての層が分離された専門家キャパシティを必要とすると仮定する。しかし、最近の分析と我々のルーティング調査はこの割り当て規則に疑問を投げかける。複数の実用MoEモデルにおいて、深い層の学習済みtop-kルーターを一様ランダムルーティングに置き換えても、下流タスクの精度は1.0-1.6ポイントしか低下しない。この冗長性に動機づけられ、我々はUniPoolを提案する。これは、層ごとの専門家保有を、独立した層ごとのルーターがアクセスする単一の共有プールに置き換え、専門家キャパシティをグローバルなアーキテクチャ予算として扱うMoEアーキテクチャである。共有下での安定かつ均衡のとれた訓練を可能にするため、プール全体で専門家利用のバランスをとるプールレベルの補助損失を導入し、共有専門家プールへの疎でスケール安定なルーティングを提供するNormRouterを採用する。The Pileからの30Bトークンで訓練した5つのLLaMAアーキテクチャモデル規模（182M, 469M, 650M, 830M, 978Mパラメータ）において、UniPoolは対応する標準MoEベースラインよりも検証損失とパープレキシティを一貫して改善した。これらの規模全体で、UniPoolは標準MoEと比較して検証損失を最大0.0386減少させた。損失改善を超えて、我々の結果はプールサイズを深さスケーリングの明示的なハイパーパラメータとして特定する。標準の専門家パラメータ予算の41.6%-66.7%のみを使用する縮小プール版UniPoolは、テストした規模において、層単位のMoEと同等以上の性能を発揮した。これは、共有プール設計の下では、専門家パラメータが深さに比例して線形的に増加する必要はなく、準線形的に増加させても標準MoEよりも効率的かつ効果的であり続けうることを示す。さらに詳細な分析は、UniPoolの利点がより細かい粒度の専門家分解と両立することを示している。

English

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

UniPool: エキスパート混合モデルのためのグローバル共有エキスパートプール

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

要旨

Support