HybriMoE: 효율적인 MoE 추론을 위한 CPU-GPU 하이브리드 스케줄링 및 캐시 관리

초록

전문가 혼합(Mixture of Experts, MoE) 아키텍처는 계산량의 비례적 증가 없이 모델 용량을 증가시킬 수 있어 상당한 이점을 입증했습니다. 그러나 대규모 MoE 모델의 크기는 여전히 상당한 메모리 요구를 유발하며, 이는 일반적으로 자원이 제한된 플랫폼에서 전문가 오프로딩을 필요로 하고 상당한 오버헤드를 초래합니다. CPU-GPU 하이브리드 추론은 CPU 연산을 활용하여 전문가 로딩 오버헤드를 줄이기 위해 제안되었지만 주요 문제에 직면해 있습니다: 한편으로 MoE 모델의 전문가 활성화 패턴은 매우 불안정하여 기존 연구의 고정 매핑 전략을 비효율적으로 만들고, 다른 한편으로 MoE를 위한 하이브리드 CPU-GPU 스케줄링은 다양한 전문가 크기, 구조, 불균일한 작업 분배 등으로 인해 본질적으로 복잡합니다. 이러한 문제를 해결하기 위해, 본 논문에서는 새로운 CPU-GPU 스케줄링 및 캐시 관리 시스템을 통해 자원 활용을 개선하는 하이브리드 CPU-GPU 추론 프레임워크인 HybriMoE를 제안합니다. HybriMoE는 (i) CPU와 GPU 간 작업 부하를 균형 있게 분배하기 위한 동적 계층 내 스케줄링 전략, (ii) 영향 기반 계층 간 프리페치 알고리즘, 그리고 (iii) 전문가 활성화 불안정성을 완화하기 위한 점수 기반 캐싱 알고리즘을 도입합니다. 우리는 HybriMoE를 kTransformers 프레임워크 위에 구현하고 널리 사용되는 세 가지 MoE 기반 LLM에서 평가했습니다. 실험 결과, HybriMoE는 최신 하이브리드 MoE 추론 프레임워크와 비교하여 프리필 단계에서 평균 1.33배, 디코드 단계에서 평균 1.70배의 성능 향상을 달성했습니다. 우리의 코드는 https://github.com/PKU-SEC-Lab/HybriMoE에서 확인할 수 있습니다.

English

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33times in the prefill stage and 1.70times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

HybriMoE: 효율적인 MoE 추론을 위한 CPU-GPU 하이브리드 스케줄링 및 캐시 관리

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

초록

Support