基于经验的研究：缩放指导调整的大型多模态模型

摘要

最近，视觉指导调优在开源大型多模态模型（LMM）如LLaVA和MiniGPT-4中显示出令人鼓舞的进展。然而，大多数现有的关于开源LMM的研究是使用具有130亿参数或更少的模型进行的。本文提出了一个关于将LLaVA扩展至330亿和650亿/700亿的经验研究，并分享了我们在图像分辨率、数据混合和参数高效训练方法（如LoRA/QLoRA）方面的发现。这些方法通过在野外完成真实任务时对多模态和语言能力的影响进行评估。我们发现，扩展LMM一贯提升模型性能并改善语言能力，而LoRA/QLoRA对LMM的调优性能与完整模型微调的性能相当。此外，研究强调了提高图像分辨率和混合多模态-语言数据以改善LMM性能的重要性，有时视觉指导调优可以提高LMM的纯语言能力。我们希望这项研究使更大规模的最先进LMM研究更易获得，从而有助于为未来研究建立更强的基线。代码和检查点将被公开发布。

English

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

基于经验的研究：缩放指导调整的大型多模态模型

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

摘要

Support