DOT-MoE：用於MoE化的可微分最優傳輸

摘要

大型語言模型（LLMs）的規模擴展雖顯著提升了效能，卻也帶來了推理效率上的嚴峻挑戰。混合專家（MoE）架構透過將模型大小與推理成本解耦來應對此問題，但從零訓練MoE模型往往不穩定且運算耗費資源。將預訓練的密集模型轉換為稀疏MoE模型已成為替代方案；然而，現有方法通常依賴啟發式神經元分群或隨機分割來將前饋網路（FFN）劃分為專家。本研究提出DOT-MoE，一個新穎框架，將密集層的分解形式化為可微分最優傳輸（DOT）問題。不同於靜態啟發式方法，我們將神經元分配建模為平衡傳輸問題，利用可微分的Sinkhorn-Knopp迭代來強制嚴格執行專家容量限制。此外，我們採用直通估計器（STE）聯合學習離散的神經元到專家分配與令牌到專家路由策略，實現端到端訓練。跨越多種架構與基準測試的大量實驗表明，DOT-MoE顯著優於結構化剪枝、啟發式分群及隨機切割基線，在將活躍參數減少50%的同時，保留了原始密集模型90%的效能。

English

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.