DOT-MoE: 面向MoE化的可微分最优传输

摘要

大语言模型（LLMs）的规模扩展带来了显著的性能提升，但同时也给推理效率带来了巨大挑战。尽管混合专家（MoE）架构通过解耦模型规模与推理成本解决了这一问题，但从零开始训练MoE模型往往不稳定且计算密集。将预训练稠密模型转换为稀疏MoE模型已成为一种替代方案；然而，现有方法通常依赖启发式神经元聚类或随机拆分来将前馈网络（FFN）划分为专家。本文提出DOT-MoE，一种新颖框架，将稠密层的分解形式化为可微最优传输（DOT）问题。与静态启发式方法不同，我们将神经元分配建模为平衡传输问题，利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外，我们利用直通估计器（STE）联合学习离散的神经元到专家分配策略以及令牌到专家的路由策略，实现端到端的优化。在多种架构和基准上的大量实验表明，DOT-MoE显著优于结构化剪枝、启发式聚类和随机拆分等基线方法，在减少50%活跃参数的同时，保留了原始稠密模型90%的性能。

English

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.