DeepSeekMoE：エキスパートの専門化を究極まで追求する Mixture-of-Experts言語モデルへのアプローチ

要旨

大規模言語モデルの時代において、Mixture-of-Experts（MoE）は、モデルパラメータをスケールアップする際の計算コストを管理するための有望なアーキテクチャです。しかし、GShardのような従来のMoEアーキテクチャは、N個のエキスパートから上位K個を活性化するため、エキスパートの専門化、すなわち各エキスパートが重複しない焦点を絞った知識を獲得することを保証する上で課題に直面しています。これに対応して、我々は究極のエキスパート専門化を目指すDeepSeekMoEアーキテクチャを提案します。これには2つの主要な戦略が含まれます：（1）エキスパートをmN個に細分化し、その中からmK個を活性化することで、活性化されるエキスパートの組み合わせをより柔軟にすること；（2）K_s個のエキスパートを共有エキスパートとして分離し、共通知識を捕捉し、ルーティングされるエキスパートの冗長性を軽減することです。2Bパラメータという控えめな規模から始め、DeepSeekMoE 2Bが、エキスパートパラメータと計算量が1.5倍のGShard 2.9Bと同等の性能を達成することを示します。さらに、DeepSeekMoE 2Bは、総パラメータ数が同じ密なモデルの性能にほぼ近づき、これはMoEモデルの上限を設定します。その後、DeepSeekMoEを16Bパラメータにスケールアップし、計算量が約40%しかないにもかかわらず、LLaMA2 7Bと同等の性能を達成することを示します。さらに、DeepSeekMoEを145Bパラメータにスケールアップする我々の予備的な取り組みは、GShardアーキテクチャに対するその大きな優位性を一貫して検証し、計算量が28.5%（場合によっては18.2%）しか使用されないにもかかわらず、DeepSeek 67Bと同等の性能を示します。

English

In the era of large language models, Mixture-of-Experts (MoE) is a promising architecture for managing computational costs when scaling up model parameters. However, conventional MoE architectures like GShard, which activate the top-K out of N experts, face challenges in ensuring expert specialization, i.e. each expert acquires non-overlapping and focused knowledge. In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating K_s experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts. Starting from a modest scale with 2B parameters, we demonstrate that DeepSeekMoE 2B achieves comparable performance with GShard 2.9B, which has 1.5 times the expert parameters and computation. In addition, DeepSeekMoE 2B nearly approaches the performance of its dense counterpart with the same number of total parameters, which set the upper bound of MoE models. Subsequently, we scale up DeepSeekMoE to 16B parameters and show that it achieves comparable performance with LLaMA2 7B, with only about 40% of computations. Further, our preliminary efforts to scale up DeepSeekMoE to 145B parameters consistently validate its substantial advantages over the GShard architecture, and show its performance comparable with DeepSeek 67B, using only 28.5% (maybe even 18.2%) of computations.

DeepSeekMoE：エキスパートの専門化を究極まで追求する Mixture-of-Experts言語モデルへのアプローチ

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

要旨

Support