一項關於擴展指導調整大型多模型模型的實證研究

摘要

最近，視覺指導調整在開源大型多模型模型（LMM）如LLaVA和MiniGPT-4中取得了令人鼓舞的進展。然而，大多數現有的開源LMM研究是使用具有130億參數或更少的模型進行的。本文介紹了將LLaVA擴展至330億和650億/700億的實證研究，並分享了我們在圖像分辨率、數據混合和諸如LoRA/QLoRA之類的參數高效訓練方法方面的發現。這些方法通過在野外完成真實任務時對多模態和語言能力的影響進行評估。我們發現，擴展LMM一貫地提升了模型性能並改善了語言能力，而LoRA/QLoRA對LMM的調整性能與完整模型微調的性能相當。此外，該研究強調了較高的圖像分辨率和混合多模態-語言數據對改善LMM性能的重要性，並且視覺指導調整有時可以提升LMM的純語言能力。我們希望這項研究使更大規模的最新LMM研究更易於訪問，從而有助於為未來研究建立更強的基準。代碼和檢查點將被公開。

English

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

一項關於擴展指導調整大型多模型模型的實證研究

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

摘要

Support