SlimMoE：通過專家精簡與蒸餾實現大型MoE模型的結構化壓縮

摘要

專家混合（Mixture of Experts, MoE）架構已成為擴展大型語言模型（LLMs）並保持推理效率的強大範式。然而，其龐大的記憶體需求使得在資源受限的環境中進行微調或部署變得極為昂貴。為應對這一挑戰，我們提出了SlimMoE，這是一個多階段壓縮框架，旨在將大型MoE模型轉化為更小、更高效的變體，而無需承擔從頭訓練的巨額成本。我們的方法通過精簡專家並通過中間階段轉移知識，系統性地減少參數數量，有效緩解了一次性剪枝方法中常見的性能下降問題。利用這一框架，我們僅使用400B的token（少於原始模型訓練數據的10%），將Phi 3.5-MoE（總參數41.9B/激活參數6.6B）壓縮為Phi-mini-MoE（總參數7.6B/激活參數2.4B）和Phi-tiny-MoE（總參數3.8B/激活參數1.1B）。這些壓縮模型可在單一GPU上進行微調（Phi-mini-MoE使用A100，Phi-tiny-MoE使用A6000），使其非常適合學術和資源有限的環境。我們的實驗表明，這些壓縮模型在相似規模的模型中表現優異，並能與更大的模型競爭。例如，Phi-mini-MoE僅使用2/3的激活參數即可達到與Phi-3-mini相似或更好的性能，並在顯著降低延遲的情況下，獲得了與Llama 3.1 8B相當的MMLU分數。我們的研究結果表明，結構化剪枝結合分階段蒸餾提供了一條創建高質量、緊湊型MoE模型的有效途徑，為MoE架構的更廣泛應用鋪平了道路。我們將模型公開於https://huggingface.co/microsoft/Phi-mini-MoE-instruct和https://huggingface.co/microsoft/Phi-tiny-MoE-instruct。

English

The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens--less than 10% of the original model's training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .

SlimMoE：通過專家精簡與蒸餾實現大型MoE模型的結構化壓縮

SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

摘要

Support