ChatPaper.aiChatPaper

Skywork-MoE:深入探討混合專家訓練技術的語言模型

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

June 3, 2024
作者: Tianwen Wei, Bo Zhu, Liang Zhao, Cheng Cheng, Biye Li, Weiwei Lü, Peng Cheng, Jianhao Zhang, Xiaoyu Zhang, Liang Zeng, Xiaokun Wang, Yutuan Ma, Rui Hu, Shuicheng Yan, Han Fang, Yahui Zhou
cs.AI

摘要

在這份技術報告中,我們介紹了在開發擁有1460億參數和16個專家的高性能混合專家(MoE)大型語言模型(LLM)Skywork-MoE時採用的訓練方法。該模型是從我們的Skywork-13B模型的現有密集檢查點初始化而來。我們探討了升級循環使用與從頭開始訓練初始化之間的比較有效性。我們的研究結果表明,在這兩種方法之間的選擇應該考慮現有密集檢查點的性能和MoE訓練預算。我們突出了兩種創新技術:閘控邏輯歸一化,可以改善專家的多樣性,以及自適應輔助損失係數,允許對輔助損失係數進行層特定調整。我們的實驗結果驗證了這些方法的有效性。利用這些技術和見解,我們在我們的SkyPile語料庫的簡化子集上訓練了我們升級的Skywork-MoE。評估結果表明,我們的模型在廣泛的基準測試中表現出色。
English
In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and insights, we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile corpus. The evaluation results demonstrate that our model delivers strong performance across a wide range of benchmarks.

Summary

AI-Generated Summary

PDF2010December 8, 2024