ChatPaper.aiChatPaper

DeepSeekMoE:迈向混合专家语言模型中极致的专家专业化

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

January 11, 2024
作者: Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, Wenfeng Liang
cs.AI

摘要

在大语言模型时代,混合专家(MoE)架构是扩展模型参数时管理计算成本的一种前景广阔的方案。然而,传统MoE架构(如GShard)采用从N个专家中激活Top-K个的设计,在确保专家专业化方面面临挑战——即让每个专家掌握非重叠的专注知识。为此,我们提出DeepSeekMoE架构以实现终极专家专业化。该架构包含两大核心策略:(1)将专家细分为mN个单元并激活其中的mK个,实现更灵活的专家组合;(2)隔离K_s个专家作为共享专家,用于捕获通用知识并减少路由专家的冗余。从20亿参数的适中规模起步,我们证明DeepSeekMoE 20亿在专家参数量和计算量仅为GShard 29亿1.5分之一的情况下,实现了与之相当的性能。此外,DeepSeekMoE 20亿几乎达到了同等总参数量稠密模型的性能上限(此为MoE模型的理论极限)。随后我们将DeepSeekMoE扩展到160亿参数规模,结果表明其仅需约40%的计算量即可达到LLaMA2 70亿的同等性能。进一步地,我们将DeepSeekMoE扩展至1450亿参数的初步尝试持续验证了其相对GShard架构的显著优势:仅需28.5%(甚至可能仅需18.2%)的计算量,即可实现与DeepSeek 670亿相当的性能。
English
In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating K_s experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.
PDF592February 7, 2026