Skywork-MoE:深入探讨混合专家训练技术的语言模型
Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models
June 3, 2024
作者: Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou
cs.AI
摘要
在本技术报告中,我们介绍了在开发拥有1460亿参数和16个专家的高性能混合专家(MoE)大型语言模型(LLM)Skywork-MoE时实施的训练方法。它是从我们的Skywork-13B模型的预先存在的密集检查点初始化的。我们探讨了升级循环利用与从头开始训练初始化的比较有效性。我们的研究结果表明,在这两种方法之间的选择应考虑现有密集检查点的性能和MoE训练预算。我们强调了两种创新技术:门控逻辑归一化,可以改善专家多样化,以及自适应辅助损失系数,允许对辅助损失系数进行特定层的调整。我们的实验结果验证了这些方法的有效性。利用这些技术和见解,我们在我们的SkyPile语料库的精简子集上训练了我们升级后的Skywork-MoE。评估结果表明,我们的模型在广泛的基准测试中表现出色。
English
In this technical report, we introduce the training methodologies implemented
in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE)
large language model (LLM) with 146 billion parameters and 16 experts. It is
initialized from the pre-existing dense checkpoints of our Skywork-13B model.
We explore the comparative effectiveness of upcycling versus training from
scratch initializations. Our findings suggest that the choice between these two
approaches should consider both the performance of the existing dense
checkpoints and the MoE training budget. We highlight two innovative
techniques: gating logit normalization, which improves expert diversification,
and adaptive auxiliary loss coefficients, allowing for layer-specific
adjustment of auxiliary loss coefficients. Our experimental results validate
the effectiveness of these methods. Leveraging these techniques and insights,
we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile
corpus. The evaluation results demonstrate that our model delivers strong
performance across a wide range of benchmarks.Summary
AI-Generated Summary