BEAM: MoE에서의 동적 라우팅을 위한 이진 전문가 활성화 마스킹

초록

Mixture-of-Experts (MoE) 구조는 토큰당 소수의 전문가(expert)만 활성화하여 대규모 언어 모델의 효율성을 향상시킵니다. 그러나 표준 MoE는 고정된 Top-K 라우팅 전략을 사용하여 중복 계산과 최적이 아닌 추론 지연 시간을 초래합니다. 기존 가속 방법은 구조 변경을 수반한 고가의 재학습이 필요하거나, 훈련-추론 불일치로 인해 높은 희소성(sparsity)에서 심각한 성능 저하를 겪습니다. 이러한 한계를 해결하기 위해, 본 논문에서는 학습 가능한 이진 마스크를 통해 토큰 적응형 전문가 선택을 학습하는 새로운 방법인 BEAM(Binary Expert Activation Masking)을 제안합니다. Straight-Through Estimator와 보조 정규화 손실(regularization loss)을 활용하여, BEAM은 모델 성능을 유지하면서 종단간 학습을 통해 동적 전문가 희소성을 유도합니다. 또한 BEAM을 위한 효율적인 맞춤형 CUDA 커널을 구현하여 vLLM 추론 프레임워크와의 원활한 통합을 보장합니다. 실험 결과, BEAM은 원본 모델 성능의 98% 이상을 유지하면서 MoE 계층 FLOPs를 최대 85%까지 감소시키고, 디코딩 속도 최대 2.5배, 처리량 최대 1.4배 향상을 달성하여 효율적인 MoE 추론을 위한 실용적이고 플러그 앤 플레이 가능한 솔루션으로서의 효과성을 입증합니다.

English

Mixture-of-Experts (MoE) architectures enhance the efficiency of large language models by activating only a subset of experts per token. However, standard MoE employs a fixed Top-K routing strategy, leading to redundant computation and suboptimal inference latency. Existing acceleration methods either require costly retraining with architectural changes or suffer from severe performance drop at high sparsity due to train-inference mismatch. To address these limitations, we propose BEAM (Binary Expert Activation Masking), a novel method that learns token-adaptive expert selection via trainable binary masks. With a straight-through estimator and an auxiliary regularization loss, BEAM induces dynamic expert sparsity through end-to-end training while maintaining model capability. We further implement an efficient custom CUDA kernel for BEAM, ensuring seamless integration with the vLLM inference framework. Experiments show that BEAM retains over 98\% of the original model's performance while reducing MoE layer FLOPs by up to 85\%, achieving up to 2.5times faster decoding and 1.4times higher throughput, demonstrating its effectiveness as a practical, plug-and-play solution for efficient MoE inference.