CompeteSMoE —— 基于竞争机制的专家混合训练及其统计保证

摘要

稀疏专家混合模型（Sparse Mixture of Experts, SMoE）提供了一种颇具吸引力的解决方案，能够在无需单纯增加网络深度或宽度的情况下，有效提升模型的复杂度。然而，我们认为，由于当前的路由过程存在不足——即执行计算的专家并未直接参与路由决策，使得高效的SMoE训练仍面临挑战。在本研究中，我们提出了一种新颖的竞争机制，用于将令牌路由至具有最高神经响应的专家。理论上，我们证明了该竞争机制相较于传统的softmax路由具有更优的样本效率。此外，我们开发了CompeteSMoE，这是一种简单而有效的算法，通过部署一个学习竞争策略的路由器来训练大规模语言模型，从而在较低的训练开销下实现强劲性能。我们在视觉指令调优和语言预训练任务上的广泛实证评估表明，与最先进的SMoE策略相比，CompeteSMoE在效能、鲁棒性和可扩展性方面均展现出显著优势。我们已在https://github.com/Fsoft-AIC/CompeteSMoE公开了实现代码。本工作是对arXiv:2402.02526先前研究的改进版本。

English

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

CompeteSMoE —— 基于竞争机制的专家混合训练及其统计保证

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

摘要

Support