言語モデリングを超えて：マルチモーダル事前学習の探求

要旨

視覚世界は、言語を超えた基盤モデルの発展において重要な軸を提供する。この方向性への関心が高まっているにもかかわらず、ネイティブマルチモーダルモデルの設計空間は不透明なままである。我々は、言語事前学習の干渉を受けずにマルチモーダル事前学習を支配する要因を分離した、制御されたゼロからの事前学習実験を通じて実証的な明確化を図る。言語には次トークン予測、視覚には拡散モデルを用いるTransfusionフレームワークを採用し、テキスト、動画、画像-テキストペア、さらには行動条件付き動画を含む多様なデータで学習を行う。実験から得られた4つの重要な知見は以下の通りである：（i）表現オートエンコーダ（RAE）は視覚的理解と生成の両方に優れることで、最適な統一視覚表現を提供する；（ii）視覚データと言語データは補完的であり、下流タスク能力に対して相乗効果をもたらす；（iii）統一されたマルチモーダル事前学習は自然に世界モデリングへと導き、一般的な訓練から能力が創発する；（iv）エキスパートの混合（MoE）は、効率的かつ効果的なマルチモーダルスケーリングを可能にすると同時に、自然にモダリティ特化を誘導する。IsoFLOP分析を通じて、両モダリティのスケーリング則を計算し、スケーリングの非対称性を明らかにした：視覚は言語よりもはるかに多くのデータを必要とする。MoEアーキテクチャが、言語に必要な高いモデル容量を提供しつつ視覚のデータ集約性を許容することで、このスケーリング非対称性を調和させることを実証し、真に統一されたマルチモーダルモデルへの道を開く。

English

The visual world offers a critical axis for advancing foundation models beyond language. Despite growing interest in this direction, the design space for native multimodal models remains opaque. We provide empirical clarity through controlled, from-scratch pretraining experiments, isolating the factors that govern multimodal pretraining without interference from language pretraining. We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision, to train on diverse data including text, video, image-text pairs, and even action-conditioned video. Our experiments yield four key insights: (i) Representation Autoencoder (RAE) provides an optimal unified visual representation by excelling at both visual understanding and generation; (ii) visual and language data are complementary and yield synergy for downstream capabilities; (iii) unified multimodal pretraining leads naturally to world modeling, with capabilities emerging from general training; and (iv) Mixture-of-Experts (MoE) enables efficient and effective multimodal scaling while naturally inducing modality specialization. Through IsoFLOP analysis, we compute scaling laws for both modalities and uncover a scaling asymmetry: vision is significantly more data-hungry than language. We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language while accommodating the data-intensive nature of vision, paving the way for truly unified multimodal models.

言語モデリングを超えて：マルチモーダル事前学習の探求

Beyond Language Modeling: An Exploration of Multimodal Pretraining

要旨

Support