MoBE: MoEベースの大規模言語モデルを圧縮するためのMixture-of-Basis-Experts

要旨

Mixture-of-Experts（MoE）アーキテクチャは、大規模言語モデル（LLM）のスケーリングにおいて主要なパラダイムとなっている。強力な性能と計算効率を提供する一方で、DeepSeek-V3-0324やKimi-K2-Instructのような大規模なMoEベースのLLMは、展開時の膨大なメモリ要件により深刻な課題を抱えている。最近の研究では、この問題に対処するためにMoEの圧縮が探求されているが、既存の手法では、控えめな圧縮率であっても精度の大幅な低下（例：相対的に7-14%）が生じることが多い。本論文では、精度の低下を最小限に抑えつつモデル圧縮を実現する新しいMixture-of-Basis-Experts（MoBE）手法を提案する。具体的には、各エキスパートのup/gate行列をランク分解によりW = ABと分解し、行列Aは各エキスパートに固有のものとする。比較的大きな行列Bは、与えられたMoE層内のすべてのエキスパート間で共有される基底行列{Bi}の線形結合として再パラメータ化される。この分解は、元の重み行列に対する再構成誤差を最小化することで学習される。実験結果は、MoBEが従来の手法と比較して顕著に低い精度低下を達成することを示している。例えば、MoBEはQwen3-235B-A22B-2507、DeepSeek-V3-0324（671B）、およびKimi-K2-Instruct（1T）のパラメータ数を24%-30%削減しつつ、精度低下はわずか1%-2%（相対的に測定すると約2%）に留まる。

English

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

MoBE: MoEベースの大規模言語モデルを圧縮するためのMixture-of-Basis-Experts

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

要旨

Support