HybriMoE: 効率的なMoE推論のためのハイブリッドCPU-GPUスケジューリングとキャッシュ管理

要旨

Mixture of Experts（MoE）アーキテクチャは、計算量の比例的な増加なしにモデル容量を拡大できるという重要な利点を実証してきました。しかし、大規模なMoEモデルのサイズは依然として多大なメモリ要求を引き起こし、リソースが制約されたプラットフォームではエキスパートのオフロードが必要となり、多大なオーバーヘッドが発生します。ハイブリッドCPU-GPU推論は、CPUの計算を活用してエキスパートのロードオーバーヘッドを削減するために提案されていますが、大きな課題に直面しています。一方で、MoEモデルのエキスパート活性化パターンは非常に不安定であり、既存研究における固定マッピング戦略は非効率的です。他方で、MoEのハイブリッドCPU-GPUスケジュールは、多様なエキスパートサイズ、構造、不均一なワークロード分布などにより、本質的に複雑です。これらの課題に対処するため、本論文では、新しいCPU-GPUスケジューリングとキャッシュ管理システムを通じてリソース利用率を向上させるハイブリッドCPU-GPU推論フレームワークであるHybriMoEを提案します。HybriMoEは、(i) CPUとGPU間のワークロードをバランスする動的層内スケジューリング戦略、(ii) 影響駆動型層間プリフェッチアルゴリズム、(iii) エキスパート活性化の不安定性を緩和するスコアベースのキャッシュアルゴリズムを導入します。HybriMoEをkTransformersフレームワーク上に実装し、広く使用されている3つのMoEベースのLLMで評価を行いました。実験結果は、HybriMoEが最先端のハイブリッドMoE推論フレームワークと比較して、プリフィル段階で平均1.33倍、デコード段階で平均1.70倍の高速化を達成することを示しています。私たちのコードは以下で公開されています：https://github.com/PKU-SEC-Lab/HybriMoE。

English

The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33times in the prefill stage and 1.70times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

HybriMoE: 効率的なMoE推論のためのハイブリッドCPU-GPUスケジューリングとキャッシュ管理

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

要旨

Support