MH-MoE: マルチヘッド専門家の混合

要旨

マルチヘッドのエキスパートの混合（MH-MoE）は、複数のエキスパートからの情報を集合的に処理するためのマルチヘッドメカニズムを使用することで、優れた性能を示しています。本論文では、FLOPsとパラメータの均等性を保持しながら、疎なエキスパートモデルと同等の性能を持つMH-MoEの新しい実装を提案します。言語モデルに関する実験結果は、新しい実装が通常のMoEおよび細かく分類されたMoEモデルよりも品質向上をもたらすことを示しています。さらに、当社の実験では、MH-MoEがBitNetなどの1ビットの大規模言語モデル（LLM）と互換性があることを示しています。

English

Multi-Head Mixture-of-Experts (MH-MoE) demonstrates superior performance by using the multi-head mechanism to collectively attend to information from various representation spaces within different experts. In this paper, we present a novel implementation of MH-MoE that maintains both FLOPs and parameter parity with sparse Mixture of Experts models. Experimental results on language models show that the new implementation yields quality improvements over both vanilla MoE and fine-grained MoE models. Additionally, our experiments demonstrate that MH-MoE is compatible with 1-bit Large Language Models (LLMs) such as BitNet.

MH-MoE: マルチヘッド専門家の混合

MH-MoE:Multi-Head Mixture-of-Experts

要旨

Support