逐层循环路由器用于专家混合模型

摘要

大型语言模型（LLMs）的扩展已经在各种任务中彻底改变了它们的能力，然而这种增长必须与高效的计算策略相匹配。混合专家（MoE）架构以其能够在不显著增加训练成本的情况下扩展模型规模的能力脱颖而出。尽管具有诸多优势，但当前的MoE模型通常表现出参数效率低下的问题。例如，一个预训练的基于MoE的LLM，拥有520亿参数，可能与一个拥有67亿参数的标准模型表现相当。作为MoE的关键组成部分，当前各层中的路由器独立分配标记，而不利用历史路由信息，可能导致次优的标记-专家组合以及参数效率问题。为了缓解这一问题，我们引入了适用于混合专家的逐层循环路由器（RMoE）。RMoE利用门控循环单元（GRU）在连续层之间建立路由决策之间的依赖关系。这种逐层循环可以高效并行计算输入标记，并引入可协商的成本。我们广泛的实证评估表明，基于RMoE的语言模型始终优于一系列基准模型。此外，RMoE集成了一种与现有方法正交的新型计算阶段，使其能够与其他MoE架构无缝兼容。我们的分析将RMoE的收益归因于其有效的跨层信息共享，这也改善了专家选择和多样性。我们的代码位于https://github.com/qiuzh20/RMoE。

English

The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion parameters. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE's gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMoE