REAM：合并优化大型语言模型专家剪枝

摘要

专家混合架构（MoE）的大型语言模型是当前性能最优的模型体系之一。参数量达数千亿的最大规模模型在部署时面临显著的内存挑战。传统降低内存需求的方法包括权重剪枝和量化。受路由加权专家激活剪枝（REAP）方法的启发，我们提出了一种创新技术——路由加权专家激活合并（REAM）。该方法不再移除专家模块，而是对其进行分组并合并权重，从而更好地保持原始性能。我们在多种MoE大语言模型上，通过多样化选择题问答和生成式评测基准，将REAM与REAP及其他基线方法进行对比。研究结果揭示了模型性能在选择题与生成任务之间存在权衡关系，且这种关系取决于校准数据的构成比例。通过调控通用文本、数学和代码数据的混合比例，我们考察了该权衡的帕累托边界，结果表明REAM在多方面超越基线方法，且在多数情况下性能可与未压缩的原始模型相媲美。

English

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.