UniPool：面向专家混合模型的全局共享专家池

摘要

现代混合专家模型（MoE）架构遵循严格的逐层规则分配专家容量：每个Transformer层拥有独立的专家集合。这种惯例将深度扩展与专家参数线性增长相耦合，并假设每层都需要独立的专家容量。然而，最新分析及我们的路由探针实验对此分配规则提出了挑战：在多个生产级MoE模型中，将深层学习到的top-k路由器替换为均匀随机路由，仅导致下游任务准确率下降1.0-1.6个点。基于这种冗余性发现，我们提出UniPool架构，通过将逐层专家所有权替换为由独立逐层路由器访问的共享专家池，将专家容量视为全局架构预算。为实现共享下的稳定平衡训练，我们引入池级辅助损失函数以平衡整个专家池的利用率，并采用NormRouter实现面向共享专家池的稀疏且尺度稳定的路由。在基于Pile数据集30B词元训练的五个LLaMA架构模型规模（1.82亿至9.78亿参数）上，UniPool相较匹配的标准MoE基线持续提升验证集损失和困惑度指标。在这些规模下，UniPool将验证损失最大降低0.0386。除损失提升外，实验结果还表明池大小可作为显式的深度扩展超参数：仅使用标准专家参数预算41.6%-66.7%的缩减池UniPool变体，在测试规模下达到或超越逐层MoE性能。这证明在共享池设计下，专家参数无需随深度线性增长，而是可以通过次线性增长实现比标准MoE更优的效能。进一步分析表明UniPool的优势可与更细粒度的专家分解策略协同增效。

English

Modern Mixture-of-Experts (MoE) architectures allocate expert capacity through a rigid per-layer rule: each transformer layer owns a separate expert set. This convention couples depth scaling with linear expert-parameter growth and assumes that every layer needs isolated expert capacity. However, recent analyses and our routing probe challenge this allocation rule: replacing a deeper layer's learned top-k router with uniform random routing drops downstream accuracy by only 1.0-1.6 points across multiple production MoE models. Motivated by this redundancy, we propose UniPool, an MoE architecture that treats expert capacity as a global architectural budget by replacing per-layer expert ownership with a single shared pool accessed by independent per-layer routers. To enable stable and balanced training under sharing, we introduce a pool-level auxiliary loss that balances expert utilization across the entire pool, and adopt NormRouter to provide sparse and scale-stable routing into the shared expert pool. Across five LLaMA-architecture model scales (182M, 469M, 650M, 830M, and 978M parameters) trained on 30B tokens from the Pile, UniPool consistently improves validation loss and perplexity over the matched vanilla MoE baselines. Across these scales, UniPool reduces validation loss by up to 0.0386 relative to vanilla MoE. Beyond raw loss improvement, our results identify pool size as an explicit depth-scaling hyperparameter: reduced-pool UniPool variants using only 41.6%-66.7% of the vanilla expert-parameter budget match or outperform layer-wise MoE at the tested scales. This shows that, under a shared-pool design, expert parameters need not grow linearly with depth; they can grow sublinearly while remaining more efficient and effective than vanilla MoE. Further analysis shows that UniPool's benefits compose with finer-grained expert decomposition.

UniPool：面向专家混合模型的全局共享专家池

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

摘要

Support