ChatPaper.aiChatPaper

CompeteSMoE —— 基於競爭的專家混合訓練的統計保證

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

May 19, 2025
作者: Nam V. Nguyen, Huy Nguyen, Quang Pham, Van Nguyen, Savitha Ramasamy, Nhat Ho
cs.AI

摘要

稀疏專家混合模型(Sparse Mixture of Experts, SMoE)提供了一種吸引人的解決方案,能夠在不單純依賴增加網絡深度或寬度的情況下,提升模型的複雜度。然而,我們認為有效的SMoE訓練仍然具有挑戰性,這主要是由於次優的路由過程,其中執行計算的專家並未直接參與路由決策。在本研究中,我們提出了一種新穎的競爭機制,用於將令牌路由至具有最高神經響應的專家。理論上,我們證明了競爭機制相比傳統的softmax路由具有更好的樣本效率。此外,我們開發了CompeteSMoE,這是一種簡單而有效的算法,通過部署路由器來學習競爭策略,從而在低訓練開銷下實現強勁的性能。我們在視覺指令調優和語言預訓練任務上的廣泛實證評估表明,與最先進的SMoE策略相比,CompeteSMoE在效能、魯棒性和可擴展性方面均表現出色。我們已將實現公開於:https://github.com/Fsoft-AIC/CompeteSMoE。本工作是對arXiv:2402.02526上先前研究的改進版本。
English
Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

Summary

AI-Generated Summary

PDF31May 21, 2025