SlimMoE:通過專家精簡與蒸餾實現大型MoE模型的結構化壓縮
SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
June 23, 2025
作者: Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao
cs.AI
摘要
專家混合(Mixture of Experts, MoE)架構已成為擴展大型語言模型(LLMs)並保持推理效率的強大範式。然而,其龐大的記憶體需求使得在資源受限的環境中進行微調或部署變得極為昂貴。為應對這一挑戰,我們提出了SlimMoE,這是一個多階段壓縮框架,旨在將大型MoE模型轉化為更小、更高效的變體,而無需承擔從頭訓練的巨額成本。我們的方法通過精簡專家並通過中間階段轉移知識,系統性地減少參數數量,有效緩解了一次性剪枝方法中常見的性能下降問題。利用這一框架,我們僅使用400B的token(少於原始模型訓練數據的10%),將Phi 3.5-MoE(總參數41.9B/激活參數6.6B)壓縮為Phi-mini-MoE(總參數7.6B/激活參數2.4B)和Phi-tiny-MoE(總參數3.8B/激活參數1.1B)。這些壓縮模型可在單一GPU上進行微調(Phi-mini-MoE使用A100,Phi-tiny-MoE使用A6000),使其非常適合學術和資源有限的環境。我們的實驗表明,這些壓縮模型在相似規模的模型中表現優異,並能與更大的模型競爭。例如,Phi-mini-MoE僅使用2/3的激活參數即可達到與Phi-3-mini相似或更好的性能,並在顯著降低延遲的情況下,獲得了與Llama 3.1 8B相當的MMLU分數。我們的研究結果表明,結構化剪枝結合分階段蒸餾提供了一條創建高質量、緊湊型MoE模型的有效途徑,為MoE架構的更廣泛應用鋪平了道路。我們將模型公開於https://huggingface.co/microsoft/Phi-mini-MoE-instruct和https://huggingface.co/microsoft/Phi-tiny-MoE-instruct。
English
The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm
for scaling large language models (LLMs) while maintaining inference
efficiency. However, their enormous memory requirements make them prohibitively
expensive to fine-tune or deploy in resource-constrained environments. To
address this challenge, we introduce SlimMoE, a multi-stage compression
framework for transforming large MoE models into much smaller, efficient
variants without incurring the prohibitive costs of training from scratch. Our
method systematically reduces parameter counts by slimming experts and
transferring knowledge through intermediate stages, effectively mitigating the
performance degradation common in one-shot pruning approaches. Using this
framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to
create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE
(3.8B total/1.1B activated parameters) using only 400B tokens--less than 10% of
the original model's training data. These compressed models can be fine-tuned
on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them
highly suitable for academic and resource-limited settings. Our experiments
demonstrate that these compressed models outperform others of similar size and
remain competitive with larger models. For instance, Phi-mini-MoE achieves
similar or better performance to Phi-3-mini using only 2/3 of the activated
parameters and yields comparable MMLU scores to Llama 3.1 8B despite having
significantly lower latency. Our findings demonstrate that structured pruning
combined with staged distillation offers an effective path to creating
high-quality, compact MoE models, paving the way for broader adoption of MoE
architectures. We make our models publicly available at
https://huggingface.co/microsoft/Phi-mini-MoE-instruct and
https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .