ネイティブマルチモーダルモデルのスケーリング則ネイティブマルチモーダルモデルのスケーリング則

要旨

マルチモーダル信号を通じて世界を効果的に知覚できる汎用モデルの構築は、長年の目標となってきた。現在のアプローチでは、視覚エンコーダを大規模言語モデル（LLM）に接続し、マルチモーダルトレーニングを継続するなど、個別に事前学習されたコンポーネントを統合する方法が取られている。このようなアプローチは顕著なサンプル効率を示すが、こうした後期融合アーキテクチャが本質的に優れているかどうかは未解決の問題である。本研究では、すべてのモダリティを一から学習するネイティブマルチモーダルモデル（NMM）のアーキテクチャ設計を再検討し、457の異なるアーキテクチャとトレーニング混合を持つモデルを対象とした大規模なスケーリング則の研究を実施した。調査の結果、後期融合アーキテクチャが早期融合アーキテクチャ（画像エンコーダに依存しない）に対して本質的な優位性を持たないことが明らかになった。むしろ、早期融合は低いパラメータ数でより強い性能を示し、トレーニング効率が高く、デプロイも容易である。早期融合アーキテクチャの優れた性能に触発され、Mixture of Experts（MoE）を組み込むことで、モダリティ固有の重みを学習するモデルが可能となり、性能が大幅に向上することを示す。

English

Building general-purpose models that can effectively perceive the world through multimodal signals has been a long-standing goal. Current approaches involve integrating separately pre-trained components, such as connecting vision encoders to LLMs and continuing multimodal training. While such approaches exhibit remarkable sample efficiency, it remains an open question whether such late-fusion architectures are inherently superior. In this work, we revisit the architectural design of native multimodal models (NMMs)--those trained from the ground up on all modalities--and conduct an extensive scaling laws study, spanning 457 trained models with different architectures and training mixtures. Our investigation reveals no inherent advantage to late-fusion architectures over early-fusion ones, which do not rely on image encoders. On the contrary, early-fusion exhibits stronger performance at lower parameter counts, is more efficient to train, and is easier to deploy. Motivated by the strong performance of the early-fusion architectures, we show that incorporating Mixture of Experts (MoEs) allows for models that learn modality-specific weights, significantly enhancing performance.

ネイティブマルチモーダルモデルのスケーリング則ネイティブマルチモーダルモデルのスケーリング則

Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models

要旨

Support