MoMa:具有多模态感知专家混合的高效早期融合预训练
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
July 31, 2024
作者: Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan
cs.AI
摘要
我们介绍了MoMa,一种新颖的面向模态的专家混合(MoE)架构,旨在为预训练混合模态、早期融合语言模型提供支持。MoMa通过将专家模块划分为模态特定的组,在任意顺序中处理图像和文本。这些组专门处理指定的标记,同时在每个组内采用学习路由以保持语义上的适应性。我们的实证结果显示,通过这种模态特定的参数分配,可以实现显著的预训练效率提升。在1万亿标记的训练预算下,MoMa 1.4B模型,包括4个文本专家和4个图像专家,实现了令人印象深刻的FLOPs节约:总体为3.7倍,其中文本为2.6倍,图像为5.2倍,与计算等效的密集基线相比,由预训练损失测量。这优于具有8个混合模态专家的标准专家选择MoE,后者实现了总体FLOPs节约3倍(文本为3倍,图像为2.8倍)。将MoMa与深度混合(MoD)相结合进一步提高了预训练FLOPs节约至总体4.2倍(文本为3.4倍,图像为5.3倍),尽管这种组合由于对路由器准确性的增加敏感性而损害了因果推断的性能。这些结果展示了MoMa在显著推进混合模态、早期融合语言模型预训练效率方面的潜力,为更具资源效率和能力的多模态人工智能系统铺平了道路。
English
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE)
architecture designed for pre-training mixed-modal, early-fusion language
models. MoMa processes images and text in arbitrary sequences by dividing
expert modules into modality-specific groups. These groups exclusively process
designated tokens while employing learned routing within each group to maintain
semantically informed adaptivity. Our empirical results reveal substantial
pre-training efficiency gains through this modality-specific parameter
allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model,
featuring 4 text experts and 4 image experts, achieves impressive FLOPs
savings: 3.7x overall, with 2.6x for text and 5.2x for image processing
compared to a compute-equivalent dense baseline, measured by pre-training loss.
This outperforms the standard expert-choice MoE with 8 mixed-modal experts,
which achieves 3x overall FLOPs savings (3x for text, 2.8x for image).
Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs
savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination
hurts performance in causal inference due to increased sensitivity to router
accuracy. These results demonstrate MoMa's potential to significantly advance
the efficiency of mixed-modal, early-fusion language model pre-training, paving
the way for more resource-efficient and capable multimodal AI systems.Summary
AI-Generated Summary