深入探討大型語言模型中的專家混合模型

摘要

由於其獨特特性和卓越表現，混合專家（Mixture-of-experts，MoE）架構尤其在語言任務中受到越來越多的關注。通過對每個標記稀疏激活一個參數子集，MoE架構可以增加模型大小而不影響計算效率，實現更好的性能和訓練成本之間的折衷。然而，MoE的基本機制仍缺乏進一步探索，其模塊化程度仍有待商榷。本文首次嘗試理解基於MoE的大型語言模型的內部運作。具體而言，我們全面研究了三個最近基於MoE的模型的參數和行為特徵，並揭示了一些有趣的觀察，包括（1）神經元的行為類似細粒度專家；（2）MoE的路由器通常選擇輸出範數較大的專家；（3）隨著層數增加，專家多樣性增加，而最後一層是個例外。基於這些觀察，我們還為廣泛範疇的MoE從業者提供建議，例如路由器設計和專家分配。我們希望這項工作能為MoE框架和其他模塊化架構的未來研究提供一些啟示。代碼可在https://github.com/kamanphoebe/Look-into-MoEs找到。

English

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

深入探討大型語言模型中的專家混合模型

A Closer Look into Mixture-of-Experts in Large Language Models

摘要

Support