将混合专家模型剪枝蒸馏为稠密语言模型

摘要

混合专家模型（MoE）如今已成为前沿语言模型的主流架构，但该架构要求将所有专家参数加载至内存中，因此并不适用于内存受限的部署场景。现有的压缩方法虽然能减少专家数量，但输出结果仍是具有相同根本局限性的MoE模型。我们提出了首个将训练好的MoE转换为标准全密集架构的系统性框架：先对专家进行评分、选择和分组，再将其拼接成密集前馈网络（FFN），并通过知识蒸馏从MoE教师模型中精炼优化。我们针对Qwen3-30B-A3B模型，在多种选定专家数量下评估了7种评分方法、5种分组方法和2种幅度缩放方法，共生成350种配置。研究发现评分方法的选择影响最大，我们提出的新型多样性感知评分方法在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于以往方法。在参数数量匹配的受控对比下，经过约40亿token的蒸馏后，MoE转密集模型在平均下游准确率上比密集到密集剪枝方法高出6.3个百分点，且训练时钟速度提升1.6倍。

English

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.