MoBE：基於專家混合的基礎專家模型用於壓縮基於MoE的大型語言模型

摘要

混合專家（Mixture-of-Experts, MoE）架構已成為擴展大型語言模型（LLMs）的主流範式。儘管MoE架構在性能和計算效率方面表現出色，但如DeepSeek-V3-0324和Kimi-K2-Instruct等基於MoE的大型語言模型在部署時面臨著巨大的記憶體需求挑戰。雖然近期研究探索了MoE壓縮以解決此問題，但現有方法即使在較低的壓縮率下也往往伴隨著顯著的準確率下降（例如相對下降7-14%）。本文提出了一種新穎的混合基底專家（Mixture-of-Basis-Experts, MoBE）方法，該方法在實現模型壓縮的同時，僅帶來極小的準確率下降。具體而言，每個專家中的上/門控矩陣通過秩分解被分解為W = AB，其中矩陣A對每個專家是獨特的。相對較大的矩陣B進一步被重新參數化為基底矩陣{Bi}的線性組合，這些基底矩陣在給定的MoE層中由所有專家共享。通過最小化與原始權重矩陣的重建誤差來學習此分解。實驗表明，與先前工作相比，MoBE實現了顯著更低的準確率下降。例如，MoBE能夠將Qwen3-235B-A22B-2507、DeepSeek-V3-0324（671B）和Kimi-K2-Instruct（1T）的參數數量減少24%-30%，而僅帶來1%-2%的準確率下降（相對測量時約為2%的下降）。

English

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

MoBE：基於專家混合的基礎專家模型用於壓縮基於MoE的大型語言模型

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

摘要

Support