MoMa: モダリティ認識エキスパートの混合による効率的な早期融合事前学習

要旨

我々は、混合モーダルな早期融合言語モデルの事前学習のために設計された、新しいモダリティ認識型エキスパート混合（MoE）アーキテクチャであるMoMaを紹介する。MoMaは、画像とテキストを任意の順序で処理するために、エキスパートモジュールをモダリティ固有のグループに分割する。これらのグループは、指定されたトークンを排他的に処理しながら、各グループ内で学習されたルーティングを採用し、意味的に情報化された適応性を維持する。我々の実験結果は、このモダリティ固有のパラメータ割り当てを通じて、事前学習の効率が大幅に向上することを明らかにしている。1兆トークンのトレーニング予算の下で、4つのテキストエキスパートと4つの画像エキスパートを備えたMoMa 1.4Bモデルは、事前学習損失で測定された計算等価な密なベースラインと比較して、全体で3.7倍、テキスト処理で2.6倍、画像処理で5.2倍のFLOPs節約を達成する。これは、8つの混合モーダルエキスパートを備えた標準的なエキスパート選択型MoEを上回り、後者は全体で3倍（テキスト：3倍、画像：2.8倍）のFLOPs節約を達成する。MoMaと深さ混合（MoD）を組み合わせることで、事前学習のFLOPs節約は全体で4.2倍（テキスト：3.4倍、画像：5.3倍）にさらに向上するが、ルーターの精度に対する感度が高まるため、因果推論の性能が低下する。これらの結果は、MoMaが混合モーダルな早期融合言語モデルの事前学習の効率を大幅に向上させる可能性を示しており、よりリソース効率的で能力の高いマルチモーダルAIシステムへの道を開くものである。

English

We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

MoMa: モダリティ認識エキスパートの混合による効率的な早期融合事前学習

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

要旨

Support