エキスパートの混合モデルにおけるμパラメータ化

要旨

近年、大規模言語モデル（LLMs）への関心と採用が高まっており、大規模なトレーニングにおけるハイパーパラメータのチューニングにおいてmuTransferが重要な技術となっています。一方で、Mixture-of-Experts（MoE）は極めて大規模なモデルにおける主要なアーキテクチャとして登場しました。しかし、これら2つの進歩の交差点は未だ探求されていません。本研究では、MoEに対するmu-Parameterization（muP）を導出し、ルーターとエキスパートの両方においてモデルの幅にわたる特徴学習の理論的保証を提供します。私たちはこのパラメータ化を実証的に検証し、さらにエキスパートの数と粒度をスケーリングすることが最適な学習率にどのように影響するかを調査します。

English

Recent years have seen a growing interest and adoption of LLMs, with muTransfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a mu-Parameterization (muP) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

エキスパートの混合モデルにおけるμパラメータ化

μ-Parametrization for Mixture of Experts

要旨

Support