CompeteSMoE -- Statisch Gegarandeerde Training van Mixture of Experts via Competitie

Samenvatting

Sparse mixture of experts (SMoE) biedt een aantrekkelijke oplossing om de modelcomplexiteit te vergroten zonder de diepte of breedte van het netwerk te verhogen. Wij stellen echter dat effectieve SMoE-training uitdagend blijft vanwege het suboptimale routeringsproces, waarbij de experts die berekeningen uitvoeren niet direct bijdragen aan het routeringsproces. In dit werk introduceren we *competition*, een nieuw mechanisme om tokens naar experts met de hoogste neurale respons te routeren. Theoretisch tonen we aan dat het *competition*-mechanisme een betere steekproefficiëntie heeft dan de traditionele softmax-routering. Daarnaast ontwikkelen we CompeteSMoE, een eenvoudig maar effectief algoritme om grote taalmodelen te trainen door een router in te zetten die het *competition*-beleid leert, waardoor het sterke prestaties levert tegen lage trainingskosten. Onze uitgebreide empirische evaluaties op zowel visuele instructieafstemming als taalpretrainingstaken demonstreren de effectiviteit, robuustheid en schaalbaarheid van CompeteSMoE in vergelijking met state-of-the-art SMoE-strategieën. We hebben de implementatie beschikbaar gesteld op: https://github.com/Fsoft-AIC/CompeteSMoE. Dit werk is een verbeterde versie van het eerdere onderzoek op arXiv:2402.02526.

English

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We have made the implementation available at: https://github.com/Fsoft-AIC/CompeteSMoE. This work is an improved version of the previous study at arXiv:2402.02526

CompeteSMoE -- Statisch Gegarandeerde Training van Mixture of Experts via Competitie

CompeteSMoE -- Statistically Guaranteed Mixture of Experts Training via Competition

Samenvatting

Summary

Support

Support