ChatPaper.aiChatPaper

DeepSeekMoE:邁向混合專家語言模型的終極專業化

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

January 11, 2024
作者: Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang
cs.AI

摘要

在大型語言模型時代,混合專家(Mixture-of-Experts,MoE)是一種有前途的架構,用於在擴展模型參數時管理計算成本。然而,像GShard這樣的傳統MoE架構,會遇到確保專家專業化的挑戰,即每個專家獲得互不重疊且專注的知識。為此,我們提出了DeepSeekMoE架構,以達到最終的專家專業化。它包含兩個主要策略:(1)將專家細分為mN個,並從中啟動mK個,從而允許更靈活地組合啟動的專家;(2)將K_s個專家隔離為共享專家,旨在捕捉共同知識並減輕路由專家中的冗余。從一個適度規模的2B參數開始,我們展示了DeepSeekMoE 2B實現了與擁有1.5倍專家參數和計算的GShard 2.9B相當的性能。此外,DeepSeekMoE 2B幾乎達到了具有相同總參數數量的密集對應性能,這確定了MoE模型的上限。隨後,我們將DeepSeekMoE擴展到16B參數,並展示它實現了與LLaMA2 7B相當的性能,僅需約40%的計算量。進一步,我們對將DeepSeekMoE擴展到145B參數的初步努力一致驗證了其相對於GShard架構的顯著優勢,並展示了其性能與DeepSeek 67B相當,僅使用了28.5%(甚至18.2%)的計算量。
English
In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating K_s experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.
PDF552December 15, 2024