DeepSeekMoE:走向混合专家语言模型的终极专业化
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
January 11, 2024
作者: Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang
cs.AI
摘要
在大语言模型时代,混合专家(Mixture-of-Experts,MoE)是一种管理计算成本的有前景的架构,用于扩大模型参数规模。然而,像GShard这样的传统MoE架构,会激活N个专家中的前K个,面临确保专家专业化的挑战,即每个专家获取非重叠且专注的知识。作为回应,我们提出了DeepSeekMoE架构,旨在实现终极专家专业化。它包括两个主要策略:(1)将专家细分为mN个,并从中激活mK个,允许更灵活地组合激活的专家;(2)将K_s个专家隔离为共享专家,旨在捕获共同知识并减少路由专家中的冗余。从规模适中的2B参数开始,我们展示了DeepSeekMoE 2B与拥有1.5倍专家参数和计算量的2.9B的GShard相当的性能。此外,DeepSeekMoE 2B几乎接近具有相同总参数数量的密集对应模型的性能上限,为MoE模型设定了上限。随后,我们将DeepSeekMoE扩展到16B参数,并展示其与LLaMA2 7B相当的性能,仅需约40%的计算量。进一步,我们初步努力将DeepSeekMoE扩展到145B参数,始终验证其相对于GShard架构的重大优势,并展示其性能与DeepSeek 67B相当,仅需28.5%(甚至可能是18.2%)的计算量。
English
In the era of large language models, Mixture-of-Experts (MoE) is a promising
architecture for managing computational costs when scaling up model parameters.
However, conventional MoE architectures like GShard, which activate the top-K
out of N experts, face challenges in ensuring expert specialization, i.e.
each expert acquires non-overlapping and focused knowledge. In response, we
propose the DeepSeekMoE architecture towards ultimate expert specialization. It
involves two principal strategies: (1) finely segmenting the experts into mN
ones and activating mK from them, allowing for a more flexible combination of
activated experts; (2) isolating K_s experts as shared ones, aiming at
capturing common knowledge and mitigating redundancy in routed experts.
Starting from a modest scale with 2B parameters, we demonstrate that
DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5
times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly
approaches the performance of its dense counterpart with the same number of
total parameters, which set the upper bound of MoE models. Subsequently, we
scale up DeepSeekMoE to 16B parameters and show that it achieves comparable
performance with LLaMA2 7B, with only about 40% of computations. Further, our
preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently
validate its substantial advantages over the GShard architecture, and show its
performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%)
of computations.