大規模言語モデルにおけるMixture-of-Expertsの詳細な考察

要旨

専門家混合モデル（Mixture-of-Experts, MoE）は、その独特な特性と特に言語タスクにおける顕著なパフォーマンスにより、ますます注目を集めています。MoEアーキテクチャは、各トークンに対してパラメータのサブセットを疎に活性化することで、計算効率を犠牲にすることなくモデルサイズを増大させ、パフォーマンスとトレーニングコストの間のより良いトレードオフを実現します。しかし、MoEの根底にあるメカニズムはまだ十分に解明されておらず、そのモジュール化の程度も疑問視されています。本論文では、MoEベースの大規模言語モデルの内部動作を理解するための最初の試みを行います。具体的には、最近の3つのMoEベースモデルのパラメトリックおよび行動的特徴を包括的に研究し、いくつかの興味深い観察結果を明らかにします。これには、(1) ニューロンが細粒度の専門家のように振る舞う、(2) MoEのルーターは通常、出力ノルムが大きい専門家を選択する、(3) 専門家の多様性は層が深くなるにつれて増加するが、最後の層は例外である、といった点が含まれます。これらの観察結果に基づいて、ルーター設計や専門家の割り当てなど、幅広いMoE実践者に対する提案も提供します。本研究が、MoEフレームワークや他のモジュール型アーキテクチャに関する将来の研究に光を当てることを願っています。コードはhttps://github.com/kamanphoebe/Look-into-MoEsで公開されています。

English

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

大規模言語モデルにおけるMixture-of-Expertsの詳細な考察

A Closer Look into Mixture-of-Experts in Large Language Models

要旨

Support