剪枝與蒸餾專家混合模型以轉化為稠密語言模型

摘要

混合專家模型（MoE）已成為前沿語言模型的主流架構，但其運作時須將所有專家參數載入記憶體，這使得它在記憶體受限的部署場景中較不理想。現有的壓縮方法雖能減少專家數量，但輸出仍是具有相同基本限制的MoE模型。我們提出首個系統性框架，可將已訓練的MoE模型轉換為標準的全密集架構：先對專家進行評分、篩選與分組，再將它們拼接成密集前饋神經網路，並透過來自MoE教師模型的知識蒸餾進行精煉。我們在Qwen3-30B-A3B上針對多種選取專家數量，評估了7種評分方法、5種分組方法及2種幅度縮放方法，共產生350種配置。研究發現評分方法的選擇影響最大，我們新穎的「多樣性感知評分」在Qwen3-30B-A3B、DeepSeek-V2-Lite與GPT-OSS-20B上始終優於先前的方法。在參數數量匹配的受控比較下，MoE轉密集模型經過約4B token的蒸餾後，其平均下游任務準確率比密集轉密集剪枝高出6.3個百分點，且訓練時鐘速度加快了1.6倍。

English

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.