专家混合模型的μ参数化方法

摘要

近年来，大型语言模型（LLMs）的关注度与日俱增，其中muTransfer技术已成为大规模训练中超参数调优的关键手段。与此同时，专家混合模型（Mixture-of-Experts, MoE）在超大规模模型中崭露头角，成为领先的架构设计。然而，这两项技术进展的交汇领域尚未得到探索。本研究中，我们为MoE模型推导出了一种mu参数化（mu-Parameterization, muP）方法，为路由器和专家模块在不同模型宽度下的特征学习提供了理论保证。我们通过实验验证了该参数化的有效性，并进一步探讨了专家数量及粒度扩展如何影响最优学习率的选择。

English

Recent years have seen a growing interest and adoption of LLMs, with muTransfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a mu-Parameterization (muP) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

专家混合模型的μ参数化方法

μ-Parametrization for Mixture of Experts

摘要

Support