PreMoe：通过专家剪枝与检索实现受限内存下的轻量化混合专家模型

摘要

专家混合（MoE）架构使得大规模语言模型（LLMs）能够扩展至海量参数规模，而无需相应增加计算成本。然而，大型MoE模型对内存的高需求阻碍了其在从云服务器到消费设备等多种计算环境中的部署。本研究首先揭示了MoE层内专家激活模式在特定任务上的显著专一性。基于此，我们提出了PreMoe，一个创新框架，旨在内存受限环境中高效部署巨型MoE模型。PreMoe包含两大核心组件：概率性专家剪枝（PEP）和任务自适应专家检索（TAER）。PEP采用了一种新指标——任务条件期望选择分数（TCESS），该分数源自路由器逻辑值，用以量化特定任务下专家的重要性，从而识别出关键专家的最小集合。TAER则利用这些任务特定的专家重要性档案进行高效推理，预先计算并存储针对不同任务的紧凑专家模式。当接收到用户查询时，TAER迅速识别最相关的存储任务模式，并通过仅加载对该任务至关重要的少量专家来重构模型，此举显著降低了所有部署场景下的内存占用。DeepSeek-R1 671B在剪枝至8/128配置（专家减少50%）时，在MATH500上保持了97.2%的准确率，即便在更为激进的8/32剪枝（专家减少87.5%）下，仍达到72.0%。Pangu-Ultra-MoE 718B在8/128剪枝下，于MATH500和AIME24上分别取得97.15%和81.3%的成绩，而进一步剪枝至4/64（内存占用390GB）时，在MATH500上的准确率仍保持在96.95%。我们的代码已公开发布于https://github.com/JarvisPei/PreMoe。

English

Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs. However, the significant memory demands of large MoE models hinder their deployment across various computational environments, from cloud servers to consumer devices. This study first demonstrates pronounced task-specific specialization in expert activation patterns within MoE layers. Building on this, we introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments. PreMoe features two main components: probabilistic expert pruning (PEP) and task-adaptive expert retrieval (TAER). PEP employs a new metric, the task-conditioned expected selection score (TCESS), derived from router logits to quantify expert importance for specific tasks, thereby identifying a minimal set of critical experts. TAER leverages these task-specific expert importance profiles for efficient inference. It pre-computes and stores compact expert patterns for diverse tasks. When a user query is received, TAER rapidly identifies the most relevant stored task pattern and reconstructs the model by loading only the small subset of experts crucial for that task. This approach dramatically reduces the memory footprint across all deployment scenarios. DeepSeek-R1 671B maintains 97.2\% accuracy on MATH500 when pruned to 8/128 configuration (50\% expert reduction), and still achieves 72.0\% with aggressive 8/32 pruning (87.5\% expert reduction). Pangu-Ultra-MoE 718B achieves 97.15\% on MATH500 and 81.3\% on AIME24 with 8/128 pruning, while even more aggressive pruning to 4/64 (390GB memory) preserves 96.95\% accuracy on MATH500. We make our code publicly available at https://github.com/JarvisPei/PreMoe.

PreMoe：通过专家剪枝与检索实现受限内存下的轻量化混合专家模型

PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval

摘要

Support