MM1：多模態LLM預訓練的方法、分析和見解

摘要

在這份工作中，我們討論建立高效的多模態大型語言模型（MLLMs）。特別是，我們研究各種架構組件和數據選擇的重要性。通過對圖像編碼器、視覺語言連接器和各種預訓練數據進行仔細和全面的消融，我們確定了幾個關鍵的設計教訓。例如，我們證明了在大規模多模態預訓練中，使用精心混合的圖像說明、交錯的圖像文本和僅文本數據對於在多個基準測試中實現最先進（SOTA）的少樣本結果至關重要，相較於其他已發表的預訓練結果。此外，我們展示了圖像編碼器與圖像解析度以及圖像標記數量的重要影響，而視覺語言連接器的設計相對較不重要。通過擴展所提出的方法，我們建立了MM1，一系列多模態模型，擁有高達30B參數，包括密集模型和專家混合變體，這些模型在預訓練指標上處於SOTA地位，在一系列已建立的多模態基準測試上，在監督微調後實現了競爭性表現。由於大規模預訓練，MM1具有吸引人的特性，如增強的上下文學習和多圖像推理，實現少樣本的思維連貫提示。

English

In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

MM1：多模態LLM預訓練的方法、分析和見解

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

摘要

Support