MoMa:具有模態感知專家混合的高效早期融合預訓練
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
July 31, 2024
作者: Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan
cs.AI
摘要
我們介紹了MoMa,一種新穎的模態感知專家混合(MoE)架構,旨在為預訓練混合模態、早期融合語言模型而設計。MoMa通過將專家模塊劃分為模態特定的群組,可以以任意序列處理圖像和文本。這些群組專門處理指定的標記,同時在每個群組內使用學習路由以保持語義上的適應性。我們的實證結果顯示,通過這種模態特定的參數分配,可以實現顯著的預訓練效率提升。在一個兆標記的訓練預算下,MoMa 1.4B模型,包括4個文本專家和4個圖像專家,實現了令人印象深刻的FLOPs節省:整體節省了3.7倍,其中文本節省了2.6倍,圖像處理節省了5.2倍,與計算等效的密集基線相比,以預訓練損失為度量。這優於具有8個混合模態專家的標準專家選擇MoE,後者實現了整體FLOPs節省3倍(文本3倍,圖像2.8倍)。將MoMa與深度混合(MoD)結合可以進一步提高預訓練FLOPs節省至整體4.2倍(文本3.4倍,圖像5.3倍),儘管這種組合會因對路由器準確性的敏感性增加而損害因果推斷的性能。這些結果展示了MoMa在顯著提升混合模態、早期融合語言模型預訓練效率方面的潛力,為更節約資源且功能更強大的多模態人工智能系統鋪平了道路。
English
We introduce MoMa, a novel modality-aware mixture-of-experts (MoE)
architecture designed for pre-training mixed-modal, early-fusion language
models. MoMa processes images and text in arbitrary sequences by dividing
expert modules into modality-specific groups. These groups exclusively process
designated tokens while employing learned routing within each group to maintain
semantically informed adaptivity. Our empirical results reveal substantial
pre-training efficiency gains through this modality-specific parameter
allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model,
featuring 4 text experts and 4 image experts, achieves impressive FLOPs
savings: 3.7x overall, with 2.6x for text and 5.2x for image processing
compared to a compute-equivalent dense baseline, measured by pre-training loss.
This outperforms the standard expert-choice MoE with 8 mixed-modal experts,
which achieves 3x overall FLOPs savings (3x for text, 2.8x for image).
Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs
savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination
hurts performance in causal inference due to increased sensitivity to router
accuracy. These results demonstrate MoMa's potential to significantly advance
the efficiency of mixed-modal, early-fusion language model pre-training, paving
the way for more resource-efficient and capable multimodal AI systems.Summary
AI-Generated Summary