Accoppiamento di Esperti e Router nei Mixture-of-Experts tramite una Perdita Ausiliaria

Abstract

I modelli Mixture-of-Experts (MoE) mancano di vincoli espliciti per garantire che le decisioni del router si allineino correttamente con le capacità degli esperti, il che limita in ultima analisi le prestazioni del modello. Per affrontare questo problema, proponiamo l'ERC loss (expert-router coupling loss), una funzione di perdita ausiliaria leggera che accoppia strettamente le decisioni del router con le capacità degli esperti. Il nostro approccio tratta l'embedding del router di ciascun esperto come un token proxy per i token assegnati a quell'esperto, e alimenta gli embedding del router perturbati attraverso gli esperti per ottenere le attivazioni interne. L'ERC loss impone due vincoli su queste attivazioni: (1) Ciascun esperto deve mostrare un'attivazione più alta per il proprio token proxy rispetto ai token proxy di qualsiasi altro esperto. (2) Ciascun token proxy deve elicitare un'attivazione più forte dal proprio esperto corrispondente rispetto a qualsiasi altro esperto. Questi vincoli garantiscono congiuntamente che ogni embedding del router rappresenti fedelmente la capacità del suo esperto corrispondente, mentre ciascun esperto si specializza nell'elaborare i token effettivamente instradati ad esso. L'ERC loss è computazionalmente efficiente, operando solo su n² attivazioni, dove n è il numero di esperti. Ciò rappresenta un costo fisso indipendente dalla dimensione del batch, a differenza dei metodi di accoppiamento precedenti che scalano con il numero di token (spesso milioni per batch). Attraverso il pre-addestramento di MoE-LLM che vanno da 3B a 15B di parametri e un'analisi estesa su trilioni di token, dimostriamo l'efficacia dell'ERC loss. Inoltre, l'ERC loss offre un controllo flessibile e un monitoraggio quantitativo dei livelli di specializzazione degli esperti durante l'addestramento, fornendo preziose intuizioni sui modelli MoE.

English

Mixture-of-Experts (MoE) models lack explicit constraints to ensure the router's decisions align well with the experts' capabilities, which ultimately limits model performance. To address this, we propose expert-router coupling (ERC) loss, a lightweight auxiliary loss that tightly couples the router's decisions with expert capabilities. Our approach treats each expert's router embedding as a proxy token for the tokens assigned to that expert, and feeds perturbed router embeddings through the experts to obtain internal activations. The ERC loss enforces two constraints on these activations: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert. These constraints jointly ensure that each router embedding faithfully represents its corresponding expert's capability, while each expert specializes in processing the tokens actually routed to it. The ERC loss is computationally efficient, operating only on n^2 activations, where n is the number of experts. This represents a fixed cost independent of batch size, unlike prior coupling methods that scale with the number of tokens (often millions per batch). Through pre-training MoE-LLMs ranging from 3B to 15B parameters and extensive analysis on trillions of tokens, we demonstrate the effectiveness of the ERC loss. Moreover, the ERC loss offers flexible control and quantitative tracking of expert specialization levels during training, providing valuable insights into MoEs.

Accoppiamento di Esperti e Router nei Mixture-of-Experts tramite una Perdita Ausiliaria

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Abstract

Support