REAM：通过合并优化大型语言模型中的专家剪枝

摘要

混合專家（MoE）大型語言模型是當前性能頂尖的架構之一。這類參數量常達數千億的最大規模模型，在部署時面臨嚴峻的記憶體挑戰。傳統降低記憶體需求的方法包括權重剪枝與量化。受基於路由器權重的專家激活剪枝（REAP）方法啟發，我們提出一種新穎的「路由器權重專家激活合併」（REAM）方法。與刪除專家不同，REAM通過分組合併專家權重，能更好地保留原始模型性能。我們在多個MoE大語言模型上，針對各類選擇題問答和生成式基準測試，將REAM與REAP及其他基準方法進行比較。結果揭示了選擇題與生成任務性能間的權衡關係，該權衡取決於校準數據的組合比例。通過調控通用文本、數學與編程數據的混合比例，我們探討了此權衡的帕累托邊界，並證明REAM在多數情況下優於基準方法，且常能達到與原始未壓縮模型相當的性能水平。

English

Mixture-of-Experts (MoE) large language models (LLMs) are among the top-performing architectures. The largest models, often with hundreds of billions of parameters, pose significant memory challenges for deployment. Traditional approaches to reduce memory requirements include weight pruning and quantization. Motivated by the Router-weighted Expert Activation Pruning (REAP) that prunes experts, we propose a novel method, Router-weighted Expert Activation Merging (REAM). Instead of removing experts, REAM groups them and merges their weights, better preserving original performance. We evaluate REAM against REAP and other baselines across multiple MoE LLMs on diverse multiple-choice (MC) question answering and generative (GEN) benchmarks. Our results reveal a trade-off between MC and GEN performance that depends on the mix of calibration data. By controlling the mix of general, math and coding data, we examine the Pareto frontier of this trade-off and show that REAM often outperforms the baselines and in many cases is comparable to the original uncompressed models.

REAM：通过合并优化大型语言模型中的专家剪枝

REAM: Merging Improves Pruning of Experts in LLMs

摘要

Support