将混合专家模型剪枝蒸馏为稠密语言模型
Pruning and Distilling Mixture-of-Experts into Dense Language Models
May 27, 2026
作者: Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho
cs.AI
摘要
混合专家模型(MoE)如今已成为前沿语言模型的主流架构,但该架构要求将所有专家参数加载至内存中,因此并不适用于内存受限的部署场景。现有的压缩方法虽然能减少专家数量,但输出结果仍是具有相同根本局限性的MoE模型。我们提出了首个将训练好的MoE转换为标准全密集架构的系统性框架:先对专家进行评分、选择和分组,再将其拼接成密集前馈网络(FFN),并通过知识蒸馏从MoE教师模型中精炼优化。我们针对Qwen3-30B-A3B模型,在多种选定专家数量下评估了7种评分方法、5种分组方法和2种幅度缩放方法,共生成350种配置。研究发现评分方法的选择影响最大,我们提出的新型多样性感知评分方法在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于以往方法。在参数数量匹配的受控对比下,经过约40亿token的蒸馏后,MoE转密集模型在平均下游准确率上比密集到密集剪枝方法高出6.3个百分点,且训练时钟速度提升1.6倍。
English
Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.