Federatie van Experts: Communicatie-efficiënte Gedistribueerde Inferentie voor Grote Taalmodellen

Samenvatting

Mengsel van Experts is uitgegroeid tot het primaire mechanisme om grote taalmodellen (Large Language Models, LLMs) rekenkundig efficiënt te maken. In gedistribueerde omgevingen vormt de communicatie van token-embeddings tussen experts echter een aanzienlijke bottleneck. We presenteren de nieuwe Federatie van Experts (Federation of Experts, FoE)-architectuur. FoE herstructureert het MoE-blok van een transformerlaag in meerdere MoE-clusters. Elk cluster is verantwoordelijk voor slechts één van de KV-koppen, en er wordt expert-parallellisme toegepast tussen deze experts. Tussen clusters zorgt een som voor synchronisatie van de post-attention-residuen, die vervolgens de routering en verzending naar het volgende MoE-blok aanstuurt. In een single-node-opstelling elimineert FoE alle all-to-all-communicatie, omdat alle experts binnen een groep zich op dezelfde GPU bevinden. In multi-node-opstellingen beperkt FoE de all-to-all-communicatie tot het intra-node-netwerk, waardoor de communicatie-overhead aanzienlijk wordt verminderd. Een implementatie van FoE toont aan dat op LongBench FoE de inferentiedoorvoer en latentie significant verbetert, zowel in single-node- als multi-node-opstellingen, met een reductie van de end-to-end forward-pass-latentie met tot 5,2×, TTFT met 3,62× en TBT met 1,95×. Dit wordt bereikt met een generatiekwaliteit die vergelijkbaar is met die van een Mengsel van Experts-model van dezelfde omvang en trainingsconfiguratie.

English

Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.