CompeteSMoE -- 競争による統計的保証付きエキスパート混合モデルのトレーニング

要旨

スパースなエキスパートの混合（SMoE）は、ネットワークの深さや幅を増やすことなく、モデルの複雑さをスケールアップする魅力的なソリューションを提供します。しかし、効果的なSMoEのトレーニングは、計算を行うエキスパートがルーティングプロセスに直接貢献しないという最適ではないルーティングプロセスのために、依然として課題であると私たちは主張します。本研究では、最も高いニューラル応答を持つエキスパートにトークンをルーティングするための新しいメカニズムである「競争」を提案します。理論的には、競争メカニズムが従来のソフトマックスルーティングよりも優れたサンプル効率を享受することを示します。さらに、競争ポリシーを学習するルーターを導入することで、低いトレーニングオーバーヘッドで強力なパフォーマンスを享受する大規模言語モデルをトレーニングするためのシンプルで効果的なアルゴリズムであるCompeteSMoEを開発します。視覚的指示チューニングと言語事前トレーニングタスクの両方における広範な実証評価は、CompeteSMoEの有効性、堅牢性、およびスケーラビリティを最先端のSMoE戦略と比較して実証しています。実装は以下で公開しています：https://github.com/Fsoft-AIC/CompeteSMoE。本研究は、arXiv:2402.02526の以前の研究の改良版です。

English

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

CompeteSMoE -- 競争による統計的保証付きエキスパート混合モデルのトレーニング

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

要旨

Support