DeepSeekMoE:邁向混合專家語言模型的極致專家專業化
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
January 11, 2024
作者: Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang
cs.AI
摘要
在大語言模型時代,混合專家(MoE)架構是擴展模型參數時管理計算成本的一種前景廣闊的方案。然而,傳統MoE架構(如GShard)採用從N個專家中激活Top-K專家的模式,在確保專家專業化——即每個專家獲取非重疊的聚焦知識——方面面臨挑戰。為此,我們提出DeepSeekMoE架構以實現極致的專家專業化。該架構包含兩大核心策略:(1)將專家細分為mN個單元並激活其中的mK個,實現更靈活的專家組合;(2)隔離K_s個專家作為共享專家,旨在捕獲共性知識並減少路由專家的冗餘。從20億參數的適中規模起步,我們證明DeepSeekMoE 20億參數模型在專家參數量與計算量僅為GShard 29億參數模型1.5倍的情況下,實現了與之相當的性能。此外,DeepSeekMoE 20億參數模型幾乎達到了同等總參數規模的稠密模型性能(此為MoE模型的上限)。隨後我們將DeepSeekMoE擴展至160億參數,結果表明其僅需約40%的計算量即可達到與LLaMA2 70億參數模型相仿的性能。進一步地,我們將DeepSeekMoE擴展至1450億參數的初步實驗持續驗證了其相較GShard架構的顯著優勢,並顯示其僅需28.5%(甚至可能僅需18.2%)的計算量即可實現與DeepSeek 670億參數模型相當的性能。
English
In the era of large language models, Mixture-of-Experts (MoE) is a promising
architecture for managing computational costs when scaling up model parameters.
However, conventional MoE architectures like GShard, which activate the top-K
out of N experts, face challenges in ensuring expert specialization, i.e.
each expert acquires non-overlapping and focused knowledge. In response, we
propose the DeepSeekMoE architecture towards ultimate expert specialization. It
involves two principal strategies: (1) finely segmenting the experts into mN
ones and activating mK from them, allowing for a more flexible combination of
activated experts; (2) isolating K_s experts as shared ones, aiming at
capturing common knowledge and mitigating redundancy in routed experts.
Starting from a modest scale with 2B parameters, we demonstrate that
DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5
times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly
approaches the performance of its dense counterpart with the same number of
total parameters, which set the upper bound of MoE models. Subsequently, we
scale up DeepSeekMoE to 16B parameters and show that it achieves comparable
performance with LLaMA2 7B, with only about 40% of computations. Further, our
preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently
validate its substantial advantages over the GShard architecture, and show its
performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%)
of computations.