PreMoe：通过专家剪枝与检索在受限内存上轻量化混合专家模型

摘要

專家混合（Mixture-of-Experts, MoE）架構使得大型語言模型（LLMs）能夠擴展至龐大的參數規模，而無需相應增加計算成本。然而，大型MoE模型對記憶體的高需求阻礙了其在各種計算環境中的部署，從雲端伺服器到消費級設備皆然。本研究首先展示了MoE層中專家激活模式在特定任務上的顯著專一性。基於此，我們提出了PreMoe，這是一個新穎的框架，旨在記憶體受限的環境中高效部署大規模MoE模型。PreMoe包含兩個主要組件：概率專家剪枝（Probabilistic Expert Pruning, PEP）和任務自適應專家檢索（Task-Adaptive Expert Retrieval, TAER）。PEP採用了一種新指標——任務條件期望選擇分數（Task-Conditioned Expected Selection Score, TCESS），該分數源自路由器的邏輯值，用於量化特定任務下專家的重要性，從而識別出一組最小但關鍵的專家。TAER則利用這些任務特定的專家重要性檔案進行高效推理。它預先計算並存儲了針對多樣任務的緊湊專家模式。當接收到用戶查詢時，TAER迅速識別出最相關的存儲任務模式，並僅加載對該任務至關重要的少數專家子集來重建模型。這種方法在所有部署場景中大幅減少了記憶體佔用。DeepSeek-R1 671B在剪枝至8/128配置（專家減少50%）時，在MATH500上保持了97.2%的準確率，而在更激進的8/32剪枝（專家減少87.5%）下仍達到了72.0%的準確率。Pangu-Ultra-MoE 718B在8/128剪枝下，於MATH500和AIME24上分別取得了97.15%和81.3%的準確率，而更為激進的4/64剪枝（390GB記憶體）在MATH500上仍保持了96.95%的準確率。我們已將代碼公開於https://github.com/JarvisPei/PreMoe。

English

Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.

PreMoe：通过专家剪枝与检索在受限内存上轻量化混合专家模型

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

摘要

Support