Skywork-MoE: 전문가 혼합 언어 모델 훈련 기법 심층 분석

초록

본 기술 보고서에서는 1460억 개의 파라미터와 16개의 전문가(expert)로 구성된 고성능 혼합 전문가(Mixture-of-Experts, MoE) 대규모 언어 모델(Large Language Model, LLM)인 Skywork-MoE의 개발 과정에서 구현된 훈련 방법론을 소개합니다. 이 모델은 기존의 Skywork-13B 모델의 조밀한(dense) 체크포인트를 초기화로 사용합니다. 우리는 초기화 방식으로서 기존 모델의 업사이클링(upcycling)과 처음부터 훈련하는 방식의 비교적 효과를 탐구하였으며, 이 두 접근 방식 사이의 선택은 기존 조밀 체크포인트의 성능과 MoE 훈련 예산을 모두 고려해야 한다는 결론을 도출했습니다. 또한, 우리는 두 가지 혁신적인 기법을 강조합니다: 전문가 다양성 개선을 위한 게이팅 로짓 정규화(gating logit normalization)와 계층별 보조 손실 계수 조정이 가능한 적응형 보조 손실 계수(adaptive auxiliary loss coefficients)입니다. 실험 결과는 이러한 방법들의 효과를 입증하였습니다. 이러한 기법과 통찰을 활용하여, 우리는 SkyPile 코퍼스의 축약된 부분집합을 사용하여 업사이클링된 Skywork-MoE를 훈련시켰습니다. 평가 결과는 우리의 모델이 다양한 벤치마크에서 강력한 성능을 보여줌을 입증합니다.

English

In this technical report, we introduce the training methodologies implemented in the development of Skywork-MoE, a high-performance mixture-of-experts (MoE) large language model (LLM) with 146 billion parameters and 16 experts. It is initialized from the pre-existing dense checkpoints of our Skywork-13B model. We explore the comparative effectiveness of upcycling versus training from scratch initializations. Our findings suggest that the choice between these two approaches should consider both the performance of the existing dense checkpoints and the MoE training budget. We highlight two innovative techniques: gating logit normalization, which improves expert diversification, and adaptive auxiliary loss coefficients, allowing for layer-specific adjustment of auxiliary loss coefficients. Our experimental results validate the effectiveness of these methods. Leveraging these techniques and insights, we trained our upcycled Skywork-MoE on a condensed subset of our SkyPile corpus. The evaluation results demonstrate that our model delivers strong performance across a wide range of benchmarks.

Skywork-MoE: 전문가 혼합 언어 모델 훈련 기법 심층 분석

Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models

초록

Summary

Support

Support