MM1.5:多模态LLM微调的方法、分析和见解
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
September 30, 2024
作者: Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang, Afshin Dehghan, Peter Grasch, Yinfei Yang
cs.AI
摘要
我们提出了MM1.5,这是一种新型的多模态大型语言模型(MLLMs),旨在增强文本丰富的图像理解、视觉指称和基础以及多图像推理能力。在MM1架构的基础上,MM1.5采用了以数据为中心的模型训练方法,系统地探索了在整个模型训练生命周期中不同数据混合的影响。这包括高质量的OCR数据和合成字幕用于持续预训练,以及针对监督微调的优化视觉指导调整数据混合。我们的模型参数范围从10亿到30亿,包括密集型和专家混合(MoE)变体,并且表明精心策划的数据整理和训练策略即使在小规模(10亿和30亿)也能产生强大的性能。此外,我们引入了两种专门的变体:MM1.5-Video,用于视频理解,以及MM1.5-UI,专为移动UI理解而设计。通过广泛的实证研究和消融实验,我们提供了有关训练过程和决策的详细见解,这些见解构成了我们最终设计的基础,为未来的MLLM开发研究提供了有价值的指导。
English
We present MM1.5, a new family of multimodal large language models (MLLMs)
designed to enhance capabilities in text-rich image understanding, visual
referring and grounding, and multi-image reasoning. Building upon the MM1
architecture, MM1.5 adopts a data-centric approach to model training,
systematically exploring the impact of diverse data mixtures across the entire
model training lifecycle. This includes high-quality OCR data and synthetic
captions for continual pre-training, as well as an optimized visual
instruction-tuning data mixture for supervised fine-tuning. Our models range
from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE)
variants, and demonstrate that careful data curation and training strategies
can yield strong performance even at small scales (1B and 3B). Additionally, we
introduce two specialized variants: MM1.5-Video, designed for video
understanding, and MM1.5-UI, tailored for mobile UI understanding. Through
extensive empirical studies and ablations, we provide detailed insights into
the training processes and decisions that inform our final designs, offering
valuable guidance for future research in MLLM development.Summary
AI-Generated Summary