专家混合模型的μ参数化方法
μ-Parametrization for Mixture of Experts
August 13, 2025
作者: Jan Małaśnicki, Kamil Ciebiera, Mateusz Boruń, Maciej Pióro, Jan Ludziejewski, Maciej Stefaniak, Michał Krutul, Sebastian Jaszczur, Marek Cygan, Kamil Adamczewski, Jakub Krajewski
cs.AI
摘要
近年来,大型语言模型(LLMs)的关注度与日俱增,其中muTransfer技术已成为大规模训练中超参数调优的关键手段。与此同时,专家混合模型(Mixture-of-Experts, MoE)在超大规模模型中崭露头角,成为领先的架构设计。然而,这两项技术进展的交汇领域尚未得到探索。本研究中,我们为MoE模型推导出了一种mu参数化(mu-Parameterization, muP)方法,为路由器和专家模块在不同模型宽度下的特征学习提供了理论保证。我们通过实验验证了该参数化的有效性,并进一步探讨了专家数量及粒度扩展如何影响最优学习率的选择。
English
Recent years have seen a growing interest and adoption of LLMs, with
muTransfer becoming a key technique for tuning hyperparameters in
large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a
leading architecture in extremely large models. However, the intersection of
these two advancements has remained unexplored. In this work, we derive a
mu-Parameterization (muP) for MoE, providing theoretical guarantees for
feature learning across model widths in both the router and experts. We
empirically validate our parameterization and further investigate how scaling
the number of experts and granularity affects the optimal learning rate.