NaViL: データ制約下におけるネイティブマルチモーダル大規模言語モデルのスケーリング特性の再考

要旨

既存のマルチモーダル大規模言語モデル（MLLM）では、事前学習済みの視覚エンコーダーと事前学習済みのLLMを連続的なマルチモーダル事前学習を通じて接続する構成学習が事実上のパラダイムとなっています。しかし、このパラダイムのマルチモーダルスケーリング特性は、分離された学習のため探求が困難です。本論文では、エンドツーエンド方式でのMLLMのネイティブ学習に焦点を当て、実用的な設定（データ制約）における設計空間とスケーリング特性を体系的に研究します。MLLMにおける様々な選択肢を慎重に検討することで、性能と学習コストのバランスが最適なメタアーキテクチャを導出しました。その後、ネイティブMLLMのスケーリング特性をさらに探求し、視覚エンコーダーとLLMの間に正の相関関係があることを示しました。これらの知見に基づき、シンプルでコスト効率の高いレシピと組み合わせたネイティブMLLM「NaViL」を提案します。14のマルチモーダルベンチマークでの実験結果は、NaViLが既存のMLLMに対して競争力のある性能を発揮することを確認しています。さらに、我々の知見と結果は、今後のネイティブMLLM研究に対する深い洞察を提供します。

English

Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.

NaViL: データ制約下におけるネイティブマルチモーダル大規模言語モデルのスケーリング特性の再考

NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints

要旨

Support