GRIN: GRadient-INformed MoE

要旨

Mixture-of-Experts（MoE）モデルは、エキスパートのルーティングを通じた疎な計算により、密なモデルよりも効果的にスケーリングされます。これにより、選択的にわずかなエキスパートモジュールのみを活性化することが可能です。ただし、疎な計算は従来のトレーニング手法に課題を提起します。なぜなら、離散的なエキスパートのルーティングが標準の誤差逆伝播を妨げ、それにより勾配ベースの最適化が困難になるからです。MoEのスケーリング能力をより良く追求するために、私たちはGRIN（GRadient-INformed MoE training）を導入しました。これは、エキスパートのルーティングのための疎な勾配推定を組み込み、トークンのドロップを回避するためにモデルの並列処理を構成します。オートレグレッシブ言語モデリングにGRINを適用し、トップ2の16times3.8B MoEモデルを開発しました。われわれのモデルは、わずか6.6Bのアクティブ化されたパラメータで、7Bの密なモデルを上回り、同じデータでトレーニングされた14Bの密なモデルと同等のパフォーマンスを発揮します。さまざまなタスクを通じた包括的な評価は、GRINがMoEの効果を著しく向上させる潜在能力を実証し、MMLUで79.4、HellaSwagで83.7、HumanEvalで74.4、MATHで58.9を達成しています。

English

Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing, selectively activating only a small subset of expert modules. However, sparse computation challenges traditional training practices, as discrete expert routing hinders standard backpropagation and thus gradient-based optimization, which are the cornerstone of deep learning. To better pursue the scaling power of MoE, we introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing and configures model parallelism to avoid token dropping. Applying GRIN to autoregressive language modeling, we develop a top-2 16times3.8B MoE model. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data. Extensive evaluations across diverse tasks demonstrate the potential of GRIN to significantly enhance MoE efficacy, achieving 79.4 on MMLU, 83.7 on HellaSwag, 74.4 on HumanEval, and 58.9 on MATH.