ChatPaper.aiChatPaper

GRIN: 基于梯度的MoE

GRIN: GRadient-INformed MoE

September 18, 2024
作者: Liyuan Liu, Young Jin Kim, Shuohang Wang, Chen Liang, Yelong Shen, Hao Cheng, Xiaodong Liu, Masahiro Tanaka, Xiaoxia Wu, Wenxiang Hu, Vishrav Chaudhary, Zeqi Lin, Chenruidong Zhang, Jilong Xue, Hany Awadalla, Jianfeng Gao, Weizhu Chen
cs.AI

摘要

混合专家(MoE)模型由于通过专家路由进行稀疏计算,仅选择激活少量专家模块,因此比密集模型更有效地扩展。然而,稀疏计算挑战传统训练方法,因为离散的专家路由阻碍了标准反向传播,从而阻碍了基于梯度的优化,这是深度学习的基石。为了更好地追求MoE的扩展能力,我们引入了GRIN(GRadient-INformed MoE training),它结合了专家路由的稀疏梯度估计,并配置模型并行性以避免标记丢失。将GRIN应用于自回归语言建模,我们开发了一个顶级16times3.8B MoE模型。我们的模型仅激活了6.6B个参数,胜过了一个7B的密集模型,并与在相同数据上训练的14B密集模型的性能相匹配。对各种任务进行广泛评估显示了GRIN显著增强MoE效果的潜力,实现了MMLU 79.4,HellaSwag 83.7,HumanEval 74.4和MATH 58.9。
English
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16times3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.

Summary

AI-Generated Summary

PDF163November 16, 2024