MM1: マルチモーダルLLM事前学習における手法、分析、および洞察

要旨

本研究では、高性能なマルチモーダル大規模言語モデル（MLLM）の構築について議論する。特に、様々なアーキテクチャコンポーネントとデータ選択の重要性を検証する。画像エンコーダ、視覚言語コネクタ、および様々な事前学習データ選択について、慎重かつ包括的なアブレーション研究を通じて、いくつかの重要な設計上の教訓を明らかにした。例えば、大規模なマルチモーダル事前学習において、画像キャプション、画像とテキストの交互配置データ、およびテキストのみのデータを慎重に組み合わせることが、他の公開されている事前学習結果と比較して、複数のベンチマークで最先端（SOTA）のFew-shot結果を達成するために重要であることを示す。さらに、画像エンコーダと画像解像度、および画像トークン数が大きな影響を持つ一方で、視覚言語コネクタの設計は比較的無視できる重要性しか持たないことを示す。提示されたレシピをスケールアップすることで、最大30BパラメータのマルチモーダルモデルファミリーであるMM1を構築した。これは、密なモデルと専門家混合（MoE）バリアントからなり、事前学習のメトリクスにおいてSOTAを達成し、確立されたマルチモーダルベンチマークでの教師ありファインチューニング後も競争力のある性能を発揮する。大規模な事前学習のおかげで、MM1は、強化されたインコンテキスト学習や複数画像推論などの魅力的な特性を享受し、Few-shotの連鎖的思考プロンプティングを可能にする。

English

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

MM1: マルチモーダルLLM事前学習における手法、分析、および洞察

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

要旨

Support