大型语言模型中混合专家模型的深入研究

摘要

混合专家（MoE）由于其独特属性和卓越性能，特别是在语言任务中，正受到越来越多的关注。通过为每个标记稀疏激活一组参数，MoE架构可以增加模型大小而不牺牲计算效率，实现更好的性能和训练成本之间的权衡。然而，MoE的基本机制仍然缺乏进一步的探索，其模块化程度仍有待验证。在本文中，我们首次尝试理解基于MoE的大型语言模型的内部运作。具体而言，我们全面研究了三个最近基于MoE的模型的参数和行为特征，并揭示了一些有趣的观察结果，包括（1）神经元表现得像细粒度专家。（2）MoE的路由器通常选择具有较大输出范数的专家。（3）随着层次的增加，专家的多样性也增加，而最后一层是一个异常值。基于这些观察结果，我们还为广泛的MoE从业者提供建议，如路由器设计和专家分配。我们希望这项工作能为MoE框架和其他模块化架构的未来研究提供启示。代码可在https://github.com/kamanphoebe/Look-into-MoEs找到。

English

Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.

大型语言模型中混合专家模型的深入研究

A Closer Look into Mixture-of-Experts in Large Language Models

摘要

Support