ChatPaper.aiChatPaper

容量感知推理:緩解專家混合模型中的落後者效應

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

March 7, 2025
作者: Shwai He, Weilin Cai, Jiayi Huang, Ang Li
cs.AI

摘要

專家混合模型(Mixture of Experts, MoE)是一種通過利用稀疏專家激活來擴展大型語言模型的有效架構,優化了性能與效率之間的權衡。然而,在專家並行機制下,MoE因令牌到專家分配不均而面臨推理效率低下的問題,部分專家過載而其他專家則未被充分利用。這種不平衡導致資源利用率低下和延遲增加,因為負擔最重的專家決定了整體的延遲,這一現象我們稱之為\textit{拖尾效應}。為緩解此問題,我們提出了容量感知推理,包含兩項關鍵技術:(1) \textit{容量感知令牌丟棄},通過丟棄過載的令牌來調節MoE的最大延遲;(2) \textit{容量感知令牌重定向},將溢出的令牌重新分配到未被充分利用的專家,平衡令牌分佈。這些技術共同優化了高負載和低負載專家的利用率,從而實現了更高效的MoE推理流程。大量實驗證明了我們方法的有效性,顯示出推理效率的顯著提升,例如在Mixtral-8×7B-Instruct上實現了0.2%的平均性能提升和1.94倍的推理加速。
English
The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textit{Straggler Effect}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textit{Capacity-Aware Token Drop}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textit{Capacity-Aware Token Reroute}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2\% average performance increase and a 1.94times inference speedup on Mixtral-8times7B-Instruct.

Summary

AI-Generated Summary

PDF42March 12, 2025