Drop-Upcycling: 部分的再初期化を用いたスパースなエキスパートの混合モデルのトレーニング

要旨

Mixture of Experts（MoE）アーキテクチャは、同等の容量を持つ密なモデルと比較して、学習と推論のコストを大幅に削減します。アップサイクリングは、事前学習済みの密なモデルを使用してMoEモデルを初期化し、学習させるアプローチです。アップサイクリングは初期の性能向上をもたらしますが、ゼロから学習させる場合と比べて学習の進みが遅く、長期的には最適な性能を発揮しません。本論文では、この問題を効果的に解決するDrop-Upcyclingという手法を提案します。Drop-Upcyclingは、一見矛盾する2つのアプローチを組み合わせています：事前学習済み密なモデルの知識を活用しつつ、重みの一部を統計的に再初期化します。このアプローチは、専門家の特化を戦略的に促進し、MoEモデルの知識獲得効率を大幅に向上させます。大規模な実験により、Drop-Upcyclingが、特に数百億トークン以上を学習する場合において、従来のMoE構築方法を長期的に大きく上回ることが実証されました。その結果、5.9Bのアクティブパラメータを持つ我々のMoEモデルは、同じモデルファミリーの13Bの密なモデルと同等の性能を達成しつつ、学習に必要なFLOPsを約1/4に削減しました。再現性とMoEに関する将来の研究を促進するため、ソースコード、学習データ、モデルチェックポイント、ログを含むすべての実験リソースを公開しています。

English

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

Drop-Upcycling: 部分的再初期化を用いたスパースなエキスパートの混合モデルのトレーニング

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

要旨

Support