Herontwerp van Mixture-of-Experts Routers met Manifold Power Iteratie

Samenvatting

De router is het fundamentele onderdeel van Mixture-of-Experts-modellen. Als expert-proxy's berekenen de rijen van de routermatrix hun overeenkomst met de MoE-ingangen om te bepalen welke subset van experts wordt geactiveerd. Idealiter is elke routerrij ontworpen om de expertmatrix te coderen in deze representatieve vector, zodat het inwendige product met de token de token-expert-affiniteit beter kan weergeven. Er bestaan echter geen ontwerpprincipes om deze condensatie af te dwingen. In dit artikel stellen we voor om elke routerrij uit te lijnen met de principale singuliere richting van de bijbehorende expert, aangezien deze richting de meest expressieve wiskundige beschrijving van een matrix biedt. Op basis van dit principe stellen we een herontwerp van de router voor met Manifold Power Iteration (MPI). Specifiek introduceert het een 'Power-then-Retract'-paradigma, waarbij een power-iteratiestap wordt uitgevoerd op de routergewichten, gevolgd door een retractie om een normbeperking op te leggen om zowel efficiëntie als stabiliteit te waarborgen. Theoretisch tonen we aan dat MPI routerrijen laat convergeren naar de principale singuliere richtingen van bijbehorende experts. Empirisch pretrainen we MoE-modellen over schalen van 1B tot 11B parameters om te bevestigen dat deze uitlijning effectievere MoE-modellen mogelijk maakt.

English

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.