BEAM: MoEにおける二値エキスパート活性化マスキングによる動的ルーティング

要旨

Mixture-of-Experts（MoE）アーキテクチャは、トークンごとに一部のエキスパートのみを活性化することで大規模言語モデルの効率を向上させる。しかし、標準的なMoEは固定のTop-Kルーティング戦略を採用しており、冗長な計算と最適でない推論レイテンシを引き起こす。既存の高速化手法は、アーキテクチャ変更を伴う高コストな再学習を必要とするか、または高い疎度において訓練と推論のミスマッチにより著しい性能低下を招く。これらの制約に対処するため、我々はBEAM（Binary Expert Activation Masking）を提案する。これは学習可能なバイナリマスクを通じてトークン適応的なエキスパート選択を学習する新規手法である。Straight-Through Estimatorと補助的正則化損失を用いることで、BEAMはモデルの性能を維持しつつ、エンドツーエンドの訓練を通じて動的なエキスパート疎性を誘導する。さらに、BEAM用の効率的なカスタムCUDAカーネルを実装し、vLLM推論フレームワークとのシームレスな統合を確保する。実験では、BEAMは元のモデルの性能の98%以上を保持しながら、MoE層のFLOPsを最大85%削減し、最大2.5倍のデコード高速化と1.4倍のスループット向上を達成しており、効率的なMoE推論のための実用的でプラグアンドプレイなソリューションとしての有効性を示している。

English

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5times faster decoding and 1.4times higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.