MoBE: MoE 기반 대형 언어 모델 압축을 위한 혼합 기반 전문가 모델

초록

전문가 혼합(Mixture-of-Experts, MoE) 아키텍처는 대규모 언어 모델(Large Language Models, LLMs)의 확장을 위한 주요 패러다임으로 자리 잡았습니다. 강력한 성능과 계산 효율성을 제공하지만, DeepSeek-V3-0324 및 Kimi-K2-Instruct와 같은 대규모 MoE 기반 LLM은 배포 시 상당한 메모리 요구 사항으로 인해 심각한 문제를 야기합니다. 최근 연구에서는 이러한 문제를 해결하기 위해 MoE 압축을 탐구했지만, 기존 방법들은 적당한 압축률에서도 상당한 정확도 하락(예: 상대적으로 7-14%)을 겪는 경우가 많습니다. 본 논문은 최소한의 정확도 하락만으로 모델 압축을 달성하는 새로운 기반 전문가 혼합(Mixture-of-Basis-Experts, MoBE) 방법을 소개합니다. 구체적으로, 각 전문가의 up/gate 행렬은 W = AB와 같은 랭크 분해를 통해 분해되며, 여기서 행렬 A는 각 전문가마다 고유합니다. 상대적으로 더 큰 행렬 B는 주어진 MoE 레이어 내의 모든 전문가들이 공유하는 기반 행렬 {Bi}의 선형 조합으로 재파라미터화됩니다. 이 분해는 원래의 가중치 행렬에 대한 재구성 오차를 최소화함으로써 학습됩니다. 실험 결과, MoBE는 기존 연구에 비해 현저히 낮은 정확도 하락을 달성함을 보여줍니다. 예를 들어, MoBE는 Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B), Kimi-K2-Instruct (1T)의 파라미터 수를 24%-30% 줄이면서도 단 1%-2%의 정확도 하락(상대적으로 측정 시 약 2% 하락)만을 기록합니다.

English

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further re-parameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

MoBE: MoE 기반 대형 언어 모델 압축을 위한 혼합 기반 전문가 모델

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

초록

Support