MoBE：基于专家混合的LLM压缩方法——专家基混合

摘要

混合专家（Mixture-of-Experts, MoE）架构已成为扩展大型语言模型（LLMs）的主流范式。尽管MoE在提供卓越性能和计算效率方面表现出色，但如DeepSeek-V3-0324和Kimi-K2-Instruct等基于MoE的大型语言模型在部署时面临严峻挑战，主要源于其巨大的内存需求。虽然近期研究探索了MoE压缩以应对此问题，但现有方法即便在适度压缩率下也常伴随显著的精度下降（例如相对下降7-14%）。本文提出了一种新颖的混合基础专家（Mixture-of-Basis-Experts, MoBE）方法，在实现模型压缩的同时，仅引入最小精度损失。具体而言，MoBE将每个专家中的上/门控矩阵通过秩分解为W = AB，其中矩阵A为每个专家独有。相对较大的矩阵B则进一步参数化为同一MoE层内所有专家共享的基础矩阵{Bi}的线性组合。通过最小化相对于原始权重矩阵的重构误差来学习这一分解。实验表明，与先前工作相比，MoBE显著降低了精度损失。例如，MoBE能够将Qwen3-235B-A22B-2507、DeepSeek-V3-0324（671B）和Kimi-K2-Instruct（1T）的参数数量减少24%-30%，而精度仅下降1%-2%（相对下降约2%）。

English

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

MoBE：基于专家混合的LLM压缩方法——专家基混合

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

摘要

Support