キャパシティ対応推論：Mixture of Expertsにおけるストラグラー効果の軽減

要旨

Mixture of Experts（MoE）は、スパースなエキスパート活性化を活用し、性能と効率性のトレードオフを最適化することで、大規模言語モデルのスケーリングに有効なアーキテクチャです。しかし、エキスパート並列処理の下では、MoEはトークンからエキスパートへの割り当ての不均衡による推論の非効率性に悩まされます。一部のエキスパートが過負荷になる一方で、他のエキスパートは十分に活用されないため、リソースの利用率が低下し、最も負荷のかかったエキスパートが全体の遅延を決定するという現象が発生します。この現象を我々は「Straggler Effect」と定義します。これを緩和するために、我々はCapacity-Aware Inferenceを提案します。これには2つの主要な技術が含まれます：（1）過負荷のトークンを破棄してMoEの最大遅延を調整する「Capacity-Aware Token Drop」と、（2）オーバーフローしたトークンを未活用のエキスパートに再割り当てし、トークン分布を均衡化する「Capacity-Aware Token Reroute」です。これらの技術を組み合わせることで、高負荷および低負荷のエキスパートの利用率を最適化し、より効率的なMoE推論パイプラインを実現します。大規模な実験により、我々の手法の有効性が実証され、推論効率の大幅な改善が示されました。例えば、Mixtral-8×7B-Instructでは、平均性能が0.2％向上し、推論速度が1.94倍に高速化されました。

English

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textit{Straggler Effect}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textit{Capacity-Aware Token Drop}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textit{Capacity-Aware Token Reroute}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2\% average performance increase and a 1.94times inference speedup on Mixtral-8times7B-Instruct.

キャパシティ対応推論：Mixture of Expertsにおけるストラグラー効果の軽減

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

要旨

Support