용량 인식 추론: 전문가 혼합 모델에서의 지연자 효과 완화

초록

전문가 혼합(Mixture of Experts, MoE)은 희소 전문가 활성화를 활용하여 대규모 언어 모델의 확장성을 높이는 동시에 성능과 효율성 간의 균형을 최적화하는 효과적인 아키텍처입니다. 그러나 전문가 병렬 처리(Expert Parallelism) 하에서 MoE는 토큰-전문가 할당의 불균형으로 인해 추론 효율성이 저하되는 문제를 겪습니다. 이는 일부 전문가는 과도하게 부하가 걸리는 반면, 다른 전문가는 활용도가 낮아지는 현상으로, 이로 인해 자원 활용도가 저하되고 지연 시간이 증가합니다. 이러한 현상을 우리는 \textit{지체 효과(Straggler Effect)}로 정의합니다. 이를 완화하기 위해 우리는 용량 인지 추론(Capacity-Aware Inference)을 제안하며, 이는 두 가지 핵심 기술로 구성됩니다: (1) \textit{용량 인지 토큰 드롭(Capacity-Aware Token Drop)}은 과부하된 토큰을 폐기하여 MoE의 최대 지연 시간을 조절하고, (2) \textit{용량 인지 토큰 재라우팅(Capacity-Aware Token Reroute)}은 오버플로된 토큰을 활용도가 낮은 전문가로 재할당하여 토큰 분포를 균형 있게 조정합니다. 이러한 기술들은 고부하 및 저부하 전문가의 활용을 최적화함으로써 더 효율적인 MoE 추론 파이프라인을 구현합니다. 광범위한 실험을 통해 우리의 방법이 추론 효율성을 크게 개선함을 입증했으며, 예를 들어 Mixtral-8x7B-Instruct 모델에서 평균 성능 0.2% 향상과 1.94배의 추론 속도 향상을 보였습니다.

English

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textit{Straggler Effect}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textit{Capacity-Aware Token Drop}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textit{Capacity-Aware Token Reroute}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2\% average performance increase and a 1.94times inference speedup on Mixtral-8times7B-Instruct.

용량 인식 추론: 전문가 혼합 모델에서의 지연자 효과 완화

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

초록

Support