MoMa: Efficiënte Pre-training met Vroege Fusie via een Mengsel van Modaal-Bewuste Experts

Samenvatting

We introduceren MoMa, een nieuwe modaliteitsbewuste mixture-of-experts (MoE)-architectuur ontworpen voor het vooraf trainen van gemengd-modale, early-fusion taalmodellen. MoMa verwerkt afbeeldingen en tekst in willekeurige volgorden door expertmodules op te delen in modaliteitsspecifieke groepen. Deze groepen verwerken uitsluitend toegewezen tokens terwijl ze geleerde routering binnen elke groep gebruiken om semantisch geïnformeerde aanpassingsvermogen te behouden. Onze empirische resultaten tonen aanzienlijke efficiëntiewinsten tijdens het vooraf trainen door deze modaliteitsspecifieke parameterallocatie. Onder een trainingsbudget van 1 biljoen tokens behaalt het MoMa 1.4B-model, met 4 teksexperts en 4 afbeeldingsexperts, indrukwekkende FLOPs-besparingen: 3,7x in totaal, met 2,6x voor tekst en 5,2x voor afbeeldingsverwerking vergeleken met een compute-equivalent dicht baseline-model, gemeten aan de hand van het verlies tijdens het vooraf trainen. Dit overtreft de standaard expert-choice MoE met 8 gemengd-modale experts, die een totale FLOPs-besparing van 3x behaalt (3x voor tekst, 2,8x voor afbeeldingen). De combinatie van MoMa met mixture-of-depths (MoD) verbetert de FLOPs-besparingen tijdens het vooraf trainen verder tot 4,2x in totaal (tekst: 3,4x, afbeeldingen: 5,3x), hoewel deze combinatie de prestaties bij causale inferentie schaadt vanwege een grotere gevoeligheid voor de nauwkeurigheid van de router. Deze resultaten tonen het potentieel van MoMa aan om de efficiëntie van gemengd-modale, early-fusion taalmodellen tijdens het vooraf trainen aanzienlijk te verbeteren, wat de weg vrijmaakt voor meer resource-efficiënte en capabele multimodale AI-systemen.

English

We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

MoMa: Efficiënte Pre-training met Vroege Fusie via een Mengsel van Modaal-Bewuste Experts

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Samenvatting

Support