剪枝與蒸餾專家混合模型以轉化為稠密語言模型
Pruning and Distilling Mixture-of-Experts into Dense Language Models
May 27, 2026
作者: Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho
cs.AI
摘要
混合專家模型(MoE)已成為前沿語言模型的主流架構,但其運作時須將所有專家參數載入記憶體,這使得它在記憶體受限的部署場景中較不理想。現有的壓縮方法雖能減少專家數量,但輸出仍是具有相同基本限制的MoE模型。我們提出首個系統性框架,可將已訓練的MoE模型轉換為標準的全密集架構:先對專家進行評分、篩選與分組,再將它們拼接成密集前饋神經網路,並透過來自MoE教師模型的知識蒸餾進行精煉。我們在Qwen3-30B-A3B上針對多種選取專家數量,評估了7種評分方法、5種分組方法及2種幅度縮放方法,共產生350種配置。研究發現評分方法的選擇影響最大,我們新穎的「多樣性感知評分」在Qwen3-30B-A3B、DeepSeek-V2-Lite與GPT-OSS-20B上始終優於先前的方法。在參數數量匹配的受控比較下,MoE轉密集模型經過約4B token的蒸餾後,其平均下游任務準確率比密集轉密集剪枝高出6.3個百分點,且訓練時鐘速度加快了1.6倍。
English
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.