ChatPaper.aiChatPaper

多模态LLM预训练的方法、分析和见解

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

March 14, 2024
作者: Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Guoli Yin, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang
cs.AI

摘要

在这项工作中,我们讨论了构建高性能多模态大型语言模型(MLLMs)。具体来说,我们研究了各种架构组件和数据选择的重要性。通过对图像编码器、视觉语言连接器以及各种预训练数据进行仔细全面的消融实验,我们确定了几个关键的设计教训。例如,我们证明了在大规模多模态预训练中,使用精心混合的图像标题、交错的图像文本以及仅文本数据对于在多个基准测试中实现最先进的少样本结果至关重要,相较于其他已发表的预训练结果。此外,我们展示了图像编码器与图像分辨率以及图像标记数量的重要影响,而视觉语言连接器的设计相对重要性较低。通过扩大所提出的方法,我们构建了MM1,一个多模态模型系列,拥有高达30B参数,包括密集模型和专家混合变体,这些模型在预训练指标上处于最先进水平,并在一系列已建立的多模态基准测试上经过监督微调后取得了竞争性能。由于大规模预训练,MM1具有诸如增强的上下文学习和多图像推理等吸引人的特性,实现了少样本思维链式提示。
English
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.

Summary

AI-Generated Summary

PDF12812December 15, 2024