大型语言模型中混合专家模型的深入研究
A Closer Look into Mixture-of-Experts in Large Language Models
June 26, 2024
作者: Ka Man Lo, Zeyu Huang, Zihan Qiu, Zili Wang, Jie Fu
cs.AI
摘要
混合专家(MoE)由于其独特属性和卓越性能,特别是在语言任务中,正受到越来越多的关注。通过为每个标记稀疏激活一组参数,MoE架构可以增加模型大小而不牺牲计算效率,实现更好的性能和训练成本之间的权衡。然而,MoE的基本机制仍然缺乏进一步的探索,其模块化程度仍有待验证。在本文中,我们首次尝试理解基于MoE的大型语言模型的内部运作。具体而言,我们全面研究了三个最近基于MoE的模型的参数和行为特征,并揭示了一些有趣的观察结果,包括(1)神经元表现得像细粒度专家。(2)MoE的路由器通常选择具有较大输出范数的专家。(3)随着层次的增加,专家的多样性也增加,而最后一层是一个异常值。基于这些观察结果,我们还为广泛的MoE从业者提供建议,如路由器设计和专家分配。我们希望这项工作能为MoE框架和其他模块化架构的未来研究提供启示。代码可在https://github.com/kamanphoebe/Look-into-MoEs找到。
English
Mixture-of-experts (MoE) is gaining increasing attention due to its unique
properties and remarkable performance, especially for language tasks. By
sparsely activating a subset of parameters for each token, MoE architecture
could increase the model size without sacrificing computational efficiency,
achieving a better trade-off between performance and training costs. However,
the underlying mechanism of MoE still lacks further exploration, and its
modularization degree remains questionable. In this paper, we make an initial
attempt to understand the inner workings of MoE-based large language models.
Concretely, we comprehensively study the parametric and behavioral features of
three recent MoE-based models and reveal some intriguing observations,
including (1) Neurons act like fine-grained experts. (2) The router of MoE
usually selects experts with larger output norms. (3) The expert diversity
increases as the layer increases, while the last layer is an outlier. Based on
the observations, we also provide suggestions for a broad spectrum of MoE
practitioners, such as router design and expert allocation. We hope this work
could shed light on future research on the MoE framework and other modular
architectures. Code is available at
https://github.com/kamanphoebe/Look-into-MoEs.Summary
AI-Generated Summary