전문가 혼합 모델을 위한 μ-파라미터화

초록

최근 몇 년간 대규모 언어 모델(LLM)에 대한 관심과 도입이 증가하면서, 대규모 학습에서 하이퍼파라미터 튜닝을 위한 핵심 기술로 muTransfer가 부상했습니다. 한편, 전문가 혼합(Mixture-of-Experts, MoE)은 초대형 모델에서 선도적인 아키텍처로 등장했습니다. 그러나 이 두 발전의 교차점은 아직 탐구되지 않았습니다. 본 연구에서 우리는 MoE를 위한 mu-파라미터화(muP)를 도출하여, 라우터와 전문가 모두에서 모델 폭에 걸친 특징 학습에 대한 이론적 보장을 제공합니다. 우리는 이 파라미터화를 실증적으로 검증하고, 전문가의 수와 세분화 정도를 확장함에 따라 최적 학습률이 어떻게 영향을 받는지 추가적으로 조사합니다.

English

Recent years have seen a growing interest and adoption of LLMs, with muTransfer becoming a key technique for tuning hyperparameters in large-scale training. Meanwhile, Mixture-of-Experts (MoE) has emerged as a leading architecture in extremely large models. However, the intersection of these two advancements has remained unexplored. In this work, we derive a mu-Parameterization (muP) for MoE, providing theoretical guarantees for feature learning across model widths in both the router and experts. We empirically validate our parameterization and further investigate how scaling the number of experts and granularity affects the optimal learning rate.

전문가 혼합 모델을 위한 μ-파라미터화

μ-Parametrization for Mixture of Experts

초록

Support